mushroomrl / mushroom-rl Goto Github PK

Python library for Reinforcement Learning.

License: MIT License

Python 99.98% Makefile 0.02%

atari ddpg deep-learning deep-reinforcement-learning dqn mujoco openai-gym pybullet pytorch qlearning reinforcement-learning rl sac trpo

mushroom-rl's People

Contributors

Stargazers

Watchers

Forkers

idilsuerdenlig crisrodriguez nixworks pratik-sanghani shubhampachori12110095 robot-ai-machinelearning intofint wangjianyuweg feiwen123 nikibobi dineshresearch yushu-liu allensmile redhood95 landoufulxf bochang11 dorok capri2014 hezez gchal kerns-ai-lab svestark jasonma2016 frankfan007 creatorcen gluecklichste b-kartal downseq ml-lab yanxg suqianxin wwxfromtju antoniopereira1996 jwc92 stjordanis andreacini jaykimbravekjh clingsii kishanpb liuyibing45 epochstamp seancarverphd galaxyrh minded-hua neo5anderson eikemdorff lrozo huq1231 puzeliu thlautenschlaeger eghbalz vanillawhey jacarvalho milutter rafaol boursa ammarfahmy tilmto skasman refaev elephann k4ntz palmieri robotgradient hjw-1014 adbmd rl-gan-vision-privacy-finance-projects yaxche-io pavim96 ml-research sun-ge philippds-forks mk788 xianzhuxiaowu ma-env salbali yizhouzhao supersglzc yk-ren neilo99 aicools paulinafriemann robfiras mandanasmi zosov jdsalmonson redoblue vk-mittal14 motherofunicorns sunshineluyao mahadev-hummanagol rancho-zhao dnlam cryptowealth-technology mcx irosa-lab liuqi8827 gowun joecosmosx cjy1227

mushroom-rl's Issues

cannot import name 'WeightedFQI'

In [3]: import mushroom.algorithms.value
/home/tyrion/.local/lib/virtualenvwrapper/fungo/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-3-dae6f65bc28c> in <module>()
----> 1 import mushroom.algorithms.value

~/src/mushroom/mushroom/algorithms/value/__init__.py in <module>()
----> 1 from .batch_td import FQI, DoubleFQI, WeightedFQI, LSPI
      2 from .dqn import DQN, DoubleDQN, AveragedDQN
      3 from .td import QLearning, DoubleQLearning, WeightedQLearning, SpeedyQLearning,\
      4     RLearning, RQLearning, SARSA, SARSALambdaDiscrete, SARSALambdaContinuous,\
      5     ExpectedSARSA, TrueOnlineSARSALambda

ImportError: cannot import name 'WeightedFQI'

suspected memory leak

Describe the bug
I run simple DQN on breakout atari game and the memory slowly increases, and after 20-30 epochs it takes 64GB of memory and after that keeps increasing. I use 1 million for the replay memory, but I thought that in 4 epochs of 250k iterations the replay memory should be already full and the used RAM shouldn't increase after that. Am I right?
I'm training on CPU, but I guess this shouldn't influence the memory leak.

System information (please complete the following information):

OS: Ubuntu
Python version 3.8.10
Torch version 1.13.0+cu117
Mushroom version 1.9.0 and 1.7.2

can not import

Hi, I can not import the package. Error information is as follows:
File "mushroom_rl\environments\mujoco_envs\humanoid_gait\_external_simulation\muscle_simulation_stepupdate.pyx", line 1, in init mushroom_rl.environments.mujoco_envs.humanoid_gait._external_simulation.muscle_simulation_stepupdate ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

how to reproduce DQN nature paper?

I'm trying to reproduce the results on breakout with DQN with mushroom, but I get much lower average rewards.
I expect to reach at least 300 as in the nature paper on DQN, but I reach 175 after 100 epochs.
I started from the example code where the parameters are already close to the nature paper, I only increased the replay memory to 1M, and tested different optimizers, but with no luck.
do you have any idea what I'm missing?

Mujoco 200 Dynamic Library Error If Configured with mushroom_rl

Describe the bug

While configuring dm_control with mushroom_rl, I am finding difficulty in configuring mujoco200. Following is the error:

AttributeError: /.mujoco/mujoco200/bin/libmujoco200.so: undefined symbol: mjr_label

Following are the environment variables set:

export MUJOCO_GL=osmesa
export MJLIB_PATH="/.mujoco/mujoco200/bin/libmujoco200.so"
export MJKEY_PATH="/.mujoco/mujoco200/mjkey.txt"
export MUJOCO_PY_MJPRO_PATH="/.mujoco/mujoco200/"
export MUJOCO_PY_MJKEY_PATH="/.mujoco/mujoco200/mjkey.txt"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/.mujoco/mujoco200/bin"

I am stuck for a long at this point. Kindly help

To Reproduce
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from mushroom_rl.algorithms.value import DQN

Expected behavior
No error

System information (please complete the following information):

OS: [e.g. Ubuntu 20.04]
Python version [Python3.7]
Torch version [Pytorch 1.10]
Mushroom version [1.70]

Additional Context
mujoco_py works perfectly fine as it can render environments from python code.

Inconsistent Parameter Order for Agents

Describe the bug
The order of constructor parameters in the agents is not consistant.

Examples:
Agent(self, policy, mdp_info, features=None)
DeepAC(self, policy, mdp_info, actor_optimizer, parameters)
A2C(self, mdp_info, policy, critic_params, **alg_params)
BlackBoxOptimization(self, distribution, policy, mdp_info, features=None)

Expected behavior
I would expect parameters to have the same order and positioning like in its superclass, where ever possible.

Examples:
Agent(self, policy, mdp_info, features=None)
DeepAC(self, policy, mdp_info, actor_optimizer, parameters)
A2C(self, policy, mdp_info, critic_params, **alg_params)
BlackBoxOptimization(self, policy, mdp_info, distribution, features=None)

System information (please complete the following information):

Mushroom version 1.2.0

Suggestion: Add median to compute_metrics

Two common performance metrics used in the RL literature are mean return and median return because the median is less influenced by outliers. However, compute_metrics() does not compute the median. It would be straightforward to change the current line

return np.min(J), np.max(J), np.mean(J), len(J)

return np.min(J), np.max(J), np.mean(J), len(J), np.median(J)

Question about the RBFs

The comments in gaussian_rbf.py say

Factory method to build uniformly spaced Gaussian radial basis functions
        with a 25\% overlap.

What is meant by the "25% overlap" here? For example,
what percentage would these 2 Gaussians overlap with?

A 100% overlap can be thought of as 2 Gaussians of the same parameters. So, an overlap percentage is 100 times the ratio of the common area of 2 Gaussians to the area of one of the Gaussians (since the width parameter is the same across these 2 Gaussians)? If this isn't true, then please let me know what is the formal definition of the "overlap".

PPO very different performance compared to StableBaselines3

Hi,

I have a question regarding the performance of PPO in MushroomRL compared to StableBaselines3.

Is it safe to say that your implementation of PPO should achieve a similar performance (in terms of mean discounted reward) compared to the PPO implementation of StableBaselines3?

By looking at the code it seems sensible to compare the two performances, by setting some hyper-parameters of PPO from StableBaselines3 in order to be able to compare it with your implementation.

However when running some experiments I came across something different:

Running experiments on MushroomRL InvertedPendulum environment, your implementation achieves superior performance (in terms of mean discounted reward).
Running experiments on an LQG environment, StableBaselines3 reaches the theoretical optimal policy whereas your implementation is extremely far off.

In both cases I used the same number of steps, the same hyper-parameters, the same network architecture, the same policy, the same environment and the same evaluation method.

Are you surprised by this? If not, why?

Thank you in advance

Could someone show me an example of DQN but using an RNN?

I'm working on a problem with small time-series data. I wrote an environment myself in mushroom.

I wanted to see if DQN where the Q-network is a GRU would help. But I keep running into problems, mainly this:

RuntimeError: Expected object of scalar type Double but got scalar type Float for argument #3 'mat2' in call to _th_addmm_out
I tried agents.approximator.model.network = agents.approximator.model.network.double() and setting the data I use to double() in every step but still it won't work. I'm wondering if somewhere along the line in mushroom things get casted to floats?

Please let me know if anyone has any insight!
I have also attached my network I know it looks gross but I have been trying to track down the problem.

`class GRUNetwork(nn.Module):
n_features = 64

def __init__(self, input_shape, output_shape, **kwargs):
    super().__init__()

    n_input = input_shape[0]
    n_output = output_shape[0]
    
    self._gru = nn.GRU(input_size=n_input,hidden_size=self.n_features,batch_first=True)
    self._h1 = nn.Linear(self.n_features,32)
    self._h2 = nn.Linear(32,16)
    self._h3 = nn.Linear(16, n_output)

    nn.init.xavier_uniform_(self._h1.weight,
                            gain=nn.init.calculate_gain('relu'))
    nn.init.xavier_uniform_(self._h2.weight,
                            gain=nn.init.calculate_gain('relu'))
    nn.init.xavier_uniform_(self._h3.weight,
                            gain=nn.init.calculate_gain('linear'))

def forward(self, state, action=None):
    state = state.type(torch.DoubleTensor)
    #print( state.dtype, state.shape, state.dim() )
    if state.dim() == 2:
        state = state.unsqueeze(1)
    if state.shape[0]==256:
        print(state[0].dtype)
        print(state[0][0].dtype)
        print(state[0])
    self._gru = self._gru.double()
    h,_ = self._gru(state.type(torch.DoubleTensor))
    h = F.relu(h.double())
    h = F.relu(self._h1(h.double()))
    h = F.relu(self._h2(h.double()))
    q = self._h3(h.double())

    if action is None:
        return q
    else:
        q_acted = torch.squeeze(q.gather(1, action.long()))

        return q_acted`

Can support multi-agent env and algorithms?

Can mushroom-rl support multi-agent or will add in the future?

Is there a way to do a quick Atari benchmark test with each model?

I see an example script for DQN, but I'd like to benchmark Atari with other models including my own custom ones. Is there an example of how I can use mushroom-rl to do this?

Question:How is reward defined for Atari Pong?

Using the code in mushroom-rl/docs/source/tutorials/code/dqn.py I trained a network to play Atari Pong. After training, I ran 1000 episodes of one game each to check the winrate of the network versus the Atari computer player. For about a quarter of the episodes the reward reported by core.evaluate was zero. I wonder what this means and if it is an intended result.

Pong is two-player game. Each player scores points and a game ends when one player gets a score of 21. The game can not end in a tie. A natural definition of the reward for a game would be score(network) - score(Atari player). This can not be zero. What is the intended reward if it is not the natural definition?

QLearning Can't Train On Episodes

If someone tries to train QLearning on more than a single step, an assertion is thrown because the length of the dataset is longer than 1.

    def fit(self, dataset):
        assert len(dataset) == 1

        state, action, reward, next_state, absorbing = self._parse(dataset)
        self._update(state, action, reward, next_state, absorbing)

The only way I've found to work around this is to train QLearning every step. But this shouldn't need to be the case; QLearning can also be trained at the end of an episode, no problem.

Trying to solve the simple chain environment using LSPI

Describe the bug A clear and concise description of what the bug is.

In hope of getting features {I(a=L),I(a=L)s,I(a=L)s^2,I(a=R),I(a=R)s,I(a=R)s^2}, I used:

basis = PolynomialBasis.generate(max_degree=2, input_size=2)
features = Features(basis_list=basis)

Although, I got an error such as:

File "examples/simplechain_lspi.py", line 77, in
steps = experiment()
File "examples/simplechain_lspi.py", line 57, in experiment
core.evaluate(n_episodes=2, render=True)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/core/core.py", line 94, in evaluate
initial_states)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/core/core.py", line 123, in _run
episodes_progress_bar, render, initial_states)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/core/core.py", line 139, in _run_impl
sample = self._step(render)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/core/core.py", line 185, in _step
action = self.agent.draw_action(self._state)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/algorithms/agent.py", line 48, in draw_action
state = self.phi(state)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/features/_implementations/basis_features.py", line 21, in call
out[i] = bf(s)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/features/basis/polynomial.py", line 41, in call
out *= x[i]**d
IndexError: index 1 is out of bounds for axis 0 with size 1

To Reproduce Provide a snippet of code or a Python file. I used cartpole_lspi.py code but with the environment: ```
mdp = generate_simple_chain(state_n=5, goal_states=[2], prob=.8, rew=1,
gamma=.9)

[simplechain_lspi.py.zip](https://github.com/MushroomRL/mushroom-rl/files/4340862/simplechain_lspi.py.zip)

**System information (please complete the following information):**
- OS: Mac 10.13.6
- Python version: Python3.7
- Torch version: Pytorch 1.3
- Mushroom version: master

compress frames

Is your feature request related to a problem? Please describe.
I would like to reduce the memory taken by RL with atari, so I can run many experiments at the same time.

Describe the solution you'd like
compress the frames, for example in rllib they use LZ4 compression. This is different from the lazyframes

if I want to implement this by myself, where should I make the change?

n_steps dqn performs worse. bug?

Describe the bug
I modified DQN to enable n_steps DQN, but I get worse results, am I missing something?

To Reproduce
use dqn with this function and defining self.n_steps in the init:

def _fit_standard(self, dataset):
       self._replay_memory.add(dataset, n_steps_return=self.n_steps, gamma=self.mdp_info.gamma)
       if self._replay_memory.initialized:
           state, action, reward, next_state, absorbing, _ = \
               self._replay_memory.get(self._batch_size())

           if self._clip_reward:
               reward = np.clip(reward, -1, 1)

           q_next = self._next_q(next_state, absorbing)
           gamma = self.mdp_info.gamma ** self.n_steps * (1 - absorbing)
           q = reward + gamma * q_next

           self.approximator.fit(state, action, q, **self._fit_params)

Expected behavior
dqn with 2 or 3 steps is worse than 1 step dqn for atari breakout and lunar, I'm not sure if it's a bug or if it's supposed to be worse. in any case it would be nice to have the dqn n_steps implemented in mushroom_rl

System information (please complete the following information):

Mushroom version 1.9

thanks

Segway environment not loaded in init

Hi,

The Segway environment is not loaded in mushroom_rl/environments/__init__.py, which is confusing when you try to import it. Since it is the only one missing I assume you simply forgot to put it there?

Best,
Robert

Question: TorchApproximator.predict - Why no torch.no_grad() and why call forward directly?

Is your feature request related to a problem? Please describe.
I was wondering why the predict method in class TorchApproximator calculates the gradients and calls self.network.forward(*torch_args, **kwargs).

Describe the solution you'd like
Why not use the with torch.no_grad() statement to save memory on the one hand and to omit the detach() call on the other hand. Further, calling self.network(*torch_args, **kwargs) instead of forward is better practice (if there's no good reason for doing otherwise).

Describe alternatives you've considered
If it is the desired behaviour, that the output_tensor flag means a tensor with gradients should be returned, the first point is void.

Additional context
Additionally, in line 254 in the same file, the .requires_grad_(False) has no effect since the .detach() call has already taken care of it.

Question: Does LSPI work for any environment other than Mushroom Cartpole?

I was trying to make LSPI work for Acrobot, but it isn't working. Used basis functions like:

# basis 1
   basis = [PolynomialBasis()]
    
     s1 = np.array([-np.pi, 0, np.pi]) * .25
     s2 = np.array([-2*np.pi, 0, 2*np.pi])
     s3 = np.array([-4*np.pi, 0, 4*np.pi])
     s = np.array(np.meshgrid(s1,s1,s1,s1,s2,s3)).T.reshape(-1,6)

     for i in s:
         basis.append(GaussianRBF(i, np.array([2.])))

    # basis 2
    basis=GaussianRBF.generate(n_centers=[3,3,3,3,3,3], low=[-1,-1,-1,-1,-4*np.pi,-9*np.pi],\
                                     high=[1,1,1,1,4*np.pi,9*np.pi])#, dimensions=[1,1,1,1])
    basis.append(PolynomialBasis())

    # basis 3
    basis = PolynomialBasis.generate(max_degree=4, input_size=6)

Does anyone have a working basis function for Acrobot or any other environment?

How to train an agent in one environment and use it on another slightly different envoirnment

I trained an Qlearning agent in one environment and want to use that same trained agent in another slightly different environment. How can I do that ?

Can't install package

Can't install the package

./python.exe -m pip install mushroom-rl

System information:

OS: Windows 10 (21H2, 19044.2006)
Python version: 3.9 & 3.10
Torch version: 1.12.1
Mushroom version: 1.6.1, 1.7.2

Logs

Building wheel for mushroom_rl (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for mushroom_rl (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [344 lines of output]

...

LINK : fatal error LNK1104: cannot open file 'build\temp.win-amd64-cpython-310\Release\mushroom_rl/environments/mujoco_envs/humanoid_gait/_external_simulation\muscle_simulation_stepupdate.cp310-win_amd64.exp'
      error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.33.31629\\bin\\HostX86\\x64\\link.exe' failed with exit code 1104
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for mushroom_rl
Failed to build mushroom_rl
ERROR: Could not build wheels for mushroom_rl, which is required to install pyproject.toml-based projects

Conjugate Gradient Method in TRPO

In my experiments the conjugate gradient method in TRPO does not seem work quite right.
The residual is getting smaller if any very slow and is not even close to the tolerance (1e-2 instead of 1e-10) after the default amount iterations (10). Thus, the cg method never stops early and neither returns exact solution.

Furthermore, I checked the progress on the objective function of the related minimization problem f(x) = 1/2 * x.T * A * x - b*x
Interestingly, the values seem to increase instead of decrease monotonically.

I added the following code in line 158 in trpo.py to monitore the progress.
Fx = self._fisher_vector_product(x, obs, old_pol_dist).detach().cpu().numpy()
print(0.5*x.dot(Fx) - b.detach().cpu().numpy().dot(x))

After some debugging, I think the problem lies in the fisher vector product. The general CG-method worked in examples where I replaced the fvp with some normal matrix vector mulitplication.

I am not sure if this problem has any effect on the overall performance of TRPO. Nevetheless, the result of the conjugate gradient defines the search direction of the line search and is therefore a main part of the optimization.

[solvers/dynamic_programming] Use np.linalg.solve instead of np.inv

First of all, thank you for setting up such an amazing an easy to use library. While working with mushroom_rl on Finite MDPs, I stumbled along the solving process inside the policy_iteration function located in ./solvers/dynamic_programming.py.

Line 76 states

value = np.linalg.inv(i - gamma * p_pi).dot(r_pi)

Here the computation can be easily substituted by just solving the corresponding linear system via

value = np.linalg.solve(i - gamma * p_pi, r_pi)

This version should be way faster (as np.inv() internally solves the systems $Ax = e^{i}$ for $i=1,\ldots, n$) and mathematically more convenient (sometimes strange things happen when applying np.inv() to large dense matrices).

Regards,
Florian

[requirements.txt] Missing requirement for OpenAI gym

Problem:
While trying to run the example montain_car_sarsa.py I got the error TypeError: 'NoneType' object is not callable. By inspecting the ./environment/__init__.py file, it was clear that openai/gym was missing.

Solution:

pip install gym

solved the issue, as well as addition the additional line gym in requirements.txt should solve this by default.

Discretization of the state space

Is there an option to discretize the state space of an environment included in this repo?

dynaq agent

Dyna-Q is a conceptual algorithm that illustrates how real and simulated experience can be combined in building a policy. Planning in RL terminology refers to using simulated experience generated by a model to find or improve a policy for interacting with a modeled environment

Any plans on having this agent in mushrrom rl ?

Additional context
Add any other context or screenshots about the feature request here.

Question: Can I create a completely custom environment?

Hi.

I want to create totally new vectorized environment that is not presented by any library.
And I tried to create it using the MushroomRL library.

However, though I read and followed the MushrromRL Document, I couldn't generate my custom environment.

URL I referenced:
https://mushroomrl.readthedocs.io/en/latest/source/tutorials/tutorials.5_environments.html#creating-a-new-environment

I thought I could create a custom environment with the MushroomRL library.
Is there anything I'm missing or misunderstood?

Tutorial for REINFORCE

I'm trying to implement a simple REINFORCE agent on Gridworld. However, I keep hitting the following error:

  File "/home/rylan/Documents/GanguliGang-Metacognitive-Actor-Critic/mac_venv/lib/python3.6/site-packages/mushroom_rl/core/core.py", line 141, in _run_impl
    sample = self._step(render)
  File "/home/rylan/Documents/GanguliGang-Metacognitive-Actor-Critic/mac_venv/lib/python3.6/site-packages/mushroom_rl/core/core.py", line 188, in _step
    action = self.agent.draw_action(self._state)
  File "/home/rylan/Documents/GanguliGang-Metacognitive-Actor-Critic/mac_venv/lib/python3.6/site-packages/mushroom_rl/core/agent.py", line 65, in draw_action
    return self.policy.draw_action(state)
  File "/home/rylan/Documents/GanguliGang-Metacognitive-Actor-Critic/mac_venv/lib/python3.6/site-packages/mushroom_rl/policy/td_policy.py", line 149, in draw_action
    return np.array([np.random.choice(self._approximator.n_actions,
AttributeError: 'NoneType' object has no attribute 'n_actions'

It appears that the policy needs to be initialized with an approximator. I would really appreciate a simple tutorial showing how to create an approximator and a policy on a simple environment.

Thanks in advance!

Please add hyper-parameter tuning options?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.

Stable baselines 3 supports hyperparameter tuning with Optuna. It would be really helpful if this is was supported in Mushroom-RL. It's a bit difficult now since all the training logic is hidden in core.learn().

Continuous control from pixels?

Hi, this library is great! Is it compatible with continuous control tasks from pixels?

Does MushroomRL support environment parallelization.

I wound like to run Mujoco environments with MushroomRL. However, on my computer, running PPO alogirithm for 1M steps in HalfCheetah would take like more than 1 hour. I think environment parallelization can greatly reduce the required time but haven't figured out the way to use vector environments. Any suggestions?

support for new spaces

can you implement or give me some guidance to implement new spaces type like in Gymnasium.

There are some important spaces such as Dict and Tuple which I'm requiring.

Thanks in advance

I save an agent with LinearParameter epsilon, when I load it, the epsilon is a Parameter

Describe the bug
If I save an agent (even with full save) where I have a LinearParameter as epsilon for EpsGreedy, and I load the agent, the epsilon is a Parameter and not a LinearParameter, so I cannot continue with the same EpsGreedy policy.
My goal is to save an agent so I can resume the training.

To Reproduce

epsilon = LinearParameter(value=args.initial_exploration_rate,
                                  threshold_value=args.final_exploration_rate,
                                  n=args.final_exploration_frame)
pi = EpsGreedy(epsilon=epsilon)
[...]
agent = alg(mdp.info, pi, approximator,
                        approximator_params=approximator_params,
                        **algorithm_params)
agent.save('agent.msh', full_save=True)
[...]
agent = DQN.load('agent.msh')
#agent.policy._epsilon is a Parameter and not a LinearParameter

Expected behavior
I expect agent.policy._epsilon to be the same type when I reload an agent from disk

System information (please complete the following information):

Python version Python3.8
Mushroom version github version as of 10/11/2021

Categorical Policy for Discrete Action Spaces?

I want to explore policy gradient and actor critic agents on GridWorld environments. To that end, I want to parameterize the policy as a Categorical distribution at each state. How do I do this?

Looking through the available policies, policy.td_policy.Boltzmann appears to perform softmax(logits), which is what I have in mind, but its logits appear to be dictated by Q values:

q_beta = self._approximator.predict(state, **self._predict_params) * self._beta(state)
q_beta -= q_beta.max()
qs = np.exp(q_beta)

I don't want the policy gradient agents to learn a Q function, and the fact that Boltzmann is under td_policy is making me hesitate because policy gradient methods are not a form of TD learning.

Some function approximators that do not come from sklearn cannot be used

Hi,

I am working on a library that uses MushroomRL at its core.

I want to be able to use any kind of function approximator, even those that do not come from sklearn but this is not always possible.

The problem is in the module: regressor.py (https://github.com/MushroomRL/mushroom-rl/blob/dev/mushroom_rl/approximators/regressor.py). Specifically in the block of lines 49-51 in which you add to the params dictionary the input_shape and the output_shape, like so:
params['input_shape'] = input_shape
params['output_shape'] = output_shape

You do this only if the approximator module does not start with 'sklearn'.

Now suppose that I use XGBoostRegressor as approximator: everything runs smoothly, I just get a warning from xgboost saying Parameters: { input_shape, output_shape } might not be used

This is all fine until I start using other approximators, such as CatBoostRegressor: here the initialisation of the object CatBoostRegressor fails: __init__() got an unexpected keyword argument 'input_shape'

Obviously I could simply change the current if condition, from: if not approximator.__module__.startswith('sklearn') to include also other modules, but this is just a bad work-around.

Are you already planning a fix so that it will be possible to use other approximators such as CatBoostRegressor?

Thanks.

Potential simple regressor for car on the hill FQI example

Currently the car on the hill FQI example works with the ExtraTreesRegressor. Is it possible for it to work with much simpler regressors, say GaussianProcessRegressor?

I tried using the sklearn.gaussian_process.GaussianProcessRegressor with kernel 1.0 * RBF(1.0), but the approximator gets stuck in the first optimization update! I'll be grateful for any possible tricks to make it work.

Suggestion: rename episodes_length to compute_episodes_length

If I can make a suggestion, perhaps you could rename the function mushroom_rl.utils.dataset.episodes_length() to mushroom_rl.utils.dataset.compute_episodes_length(). This would better match your function naming convention (e.g. compute_J(), compute_metrics()) and would also allow me to name the returned variable episode_lengths.

Thanks in advance!

Metrics of algorithm performance

Is your feature request related to a problem? Please describe.
The performance of RL algorithms is very sensitive to the implementation (including tricks not mentioned in their papers). A good RL package should have benchmarks of how well its implementations perform as quantitative checks.

Describe the solution you'd like
Would it be possible to maintain a set of benchmarks for each of the proposed algorithm on standard tasks so that users can be sure that the implementations are faithful to the source code released by the original authors?

Even if the implementations here can't fully reproduce the results in papers, it would be good to benchmark the level of performance one can expect from the implementations in this package.

Describe alternatives you've considered
None

Python check mushroom version

Describe the solution you'd like
For checking the mushroom version it would be good, if it can be accessed in python after import.

import mushroom
print(mushroom.__version__)

Question: Does LSPI work for cartpole constructed in gym rather than using our own environment?

Since the Cartpole environment in gym has a 4-tuple state, I used the following to define the features in cartpole_lspi.py:

s1 = np.array([-2, -1, 0, 1, 2])
    s2 = np.array([-1, 0, 1])
    s3 = np.array([-np.pi, 0, np.pi]) * .125
    s4 = np.array([-1, 0, 1])
    s = np.array(np.meshgrid(s1,s2,s3,s4)).T.reshape(-1,4)
    for i in s:
        basis.append(GaussianRBF(i, np.array([1.])))

To use the Cartpole environment from gym, I changed
mdp = CartPole() to
```
horizon = 200
gamma = 0.95
mdp = Gym('CartPole-v0', horizon, gamma)


With this, LSPI wasn't able to learn. Has anyone tried this before? If not, any suggestions on what else to try to make LSPI learn Cartpole from gym?

Tutorial / Demonstration of Custom Training Loop

Stable Baselines 3 opens with showing how to train an agent in one of two ways: first with a custom loop, the second with their .train() method

https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html

Approach 1:

import gym

from stable_baselines3 import A2C

env = gym.make('CartPole-v1')

model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
      obs = env.reset()

Approach 2:

from stable_baselines3 import A2C

model = A2C('MlpPolicy', 'CartPole-v1').learn(10000)

I can find mushroom-rl's equivalent of Approach 2, but I can't find the equivalent of Approach 1. Could someone please provide a tutorial or demonstration?

Thank you in advance!

REINFORCE with optional baseline

REINFORCE, as implemented, always uses a baseline. I would like a flag that makes the baseline optional, specifically to explore the effect that a baseline has on REINFORCE.

Question: How can I manage the reproducibility of an experiment?

Hi,
I'm currently setting a seed value for the environment using its seed methods, but when I run it multiple times I get very different results.
I know this issue involves many variables that the results may diverge, but I was wondering if there is any another parameter to be set in order to reduce this divergence?

I'm using your advance experiment as base to run a Gym.CartPole environment with LinearDecay epsilon and learning rate parameters just to learn how your library works, because this problem I coded from scratch with Q-learning with successfully results.

My code is something like this

environment = Gym(name='CartPole-v0', horizon=np.inf, gamma=1.)
environment.seed(seed_val)

# Policy
linear_epsilon = LinearParameter(0.9, 0.1, n=episodes//2)
pi = EpsGreedy(epsilon=linear_epsilon)

# state codification
n_tilings = 1
tilings = Tiles.generate(n_tilings, [1, 1, 6, 3],
                         environment.info.observation_space.low,
                         environment.info.observation_space.high)
features = Features(tilings=tilings)

approximator_params = dict(input_shape=(features.size,),
                           output_shape=(environment.info.action_space.n,),
                           n_actions=environment.info.action_space.n)

# agent
linear_alpha = LinearParameter(0.5, 0.1, n=episodes//2)
agent = SARSALambdaContinuous(environment.info, pi, LinearApproximator,
                              approximator_params=approximator_params,
                              learning_rate=linear_alpha,
                              lambda_coeff=.9, features=features)

Currently is not learning, but that is not the issue, just the different obtained results.

Thanks,

Adapting A2C and deep policy gradient methods to Discrete envs

Describe the bug
Hi, I am getting the following error. I’m trying to use the A2C algorithm, which samples actions from a Gaussian distribution when given a state. The code seems expects torch.float32. I have checked and my inputs are indeed torch.float32 not Long. I’m not sure what to do. I know the algorithm runs as I have run it before, but I get this error when I try to use it in an environment with discrete actions.

Here is the library’s draw_action function:

def draw_action_t(self, state):
        print('draw_action_t',state.dtype)
        return self.distribution_t(state).sample().detach()

and

def distribution_t(self, state):
        mu, sigma = self.get_mean_and_covariance(state)
        return torch.distributions.MultivariateNormal(loc=mu, covariance_matrix=sigma)

Final error:

RuntimeError: _th_normal_ not supported on CPUType for Long

Help is much appreciated!

System information (please complete the following information):

OS: macOS 10.12.6
Python version: Python3.6
Torch version: Pytorch 1.2
Mushroom version: master

Additional context
I'm working with the Gym 'CartPole-v0' environment.
I need these deep policy gradient methods to work for discrete environments as I am testing something for my research. Any advice on this?
Help is greatly appreciated!

Unable to set the environment seed

Describe the bug
I get NotImplementedError when I try to set seed on some of the envrionments.

To Reproduce

env = PuddleWorld()
env.seed(seed)

Expected behavior
I expected seed to be set

System information (please complete the following information):

OS: macOS Big Sur 11.6
Python version Python3.8.6
Mushroom version 1.7.0

Additional context
Error message I get

File  "/python3.8/site-packages/mushroom_rl/core/environment.py", line 137, in seed
     raise NotImplementedError
 NotImplementedError

PPO for lunar lander [BUG]

I'm trying to use the PPO for the lunar lander but I can't find examples and my code doesn't seem to converge, can you spot the issue? some parameter is wrong?
alg = PPO

from mushroom_rl.policy import BoltzmannTorchPolicy
policy_params = dict(
    std_0=1.,
    n_features=32,
    use_cuda=torch.cuda.is_available()
)
algorithm_params = dict(
    batch_size=128,
    actor_optimizer=optimizer,
    n_epochs_policy=4,
    eps_ppo=.2, lam=.95,
    critic_params=dict(network=net,
                       optimizer=optimizer,
                       loss=F.mse_loss,
                       n_features=32,
                       batch_size=128,
                       input_shape=mdp.info.observation_space.shape,
                       output_shape=(1,))
    )
beta = Parameter(1e0)
policy = BoltzmannTorchPolicy(net, mdp.info.observation_space.shape,
                              mdp.info.action_space.shape,
                              beta, **policy_params)

agent = alg(mdp.info, policy, **algorithm_params)
[...]
core.learn(n_steps=10000,
                       n_steps_per_fit=3000, quiet=args.quiet)

Incorrect Shape of Baseline in REINFORCE

In mushroom_rl.algorithms.policy_search.policy_gradient.reinforce, the gradient is computed as:

    def _compute_gradient(self, J):
        baseline = np.mean(self.baseline_num, axis=0) / np.mean(self.baseline_den, axis=0)
        ...

However, the baseline numerator and denominator are calculated as:

        self.list_sum_d_log_pi.append(self.sum_d_log_pi)
        squared_sum_d_log_pi = np.square(self.sum_d_log_pi)
        self.baseline_num.append(squared_sum_d_log_pi * self.J_episode)
        self.baseline_den.append(squared_sum_d_log_pi)

We can see here that the baseline_num and baseline_den both have the same shape as d_log_pi, meaning baseline has the same shape. Surely the shape of the baseline should be a scalar?

[Categorical DQN/Rainbow] Inconsistent behavior of Categorical DQN for an even number of atoms

For an even number of atoms, the calculation of self._a_values (see here) does not seem to be 100% correct. This behavior is reproducible via

import torch
v_min = -5
v_max = 5
n_atoms = 20
delta = (v_max -  v_min) / (n_atoms - 1) # delta = 0.5263157894736842
torch.arange(v_min, v_max + delta, delta)

which yields

tensor([-5.0000, -4.4737, -3.9474, -3.4211, -2.8947, -2.3684, -1.8421, -1.3158,
        -0.7895, -0.2632,  0.2632,  0.7895,  1.3158,  1.8421,  2.3684,  2.8947,
         3.4211,  3.9474,  4.4737,  5.0000,  5.5263])

and is too big. The expected result would be this tensor:

tensor([-5.0000, -4.4737, -3.9474, -3.4211, -2.8947, -2.3684, -1.8421, -1.3158,
        -0.7895, -0.2632,  0.2632,  0.7895,  1.3158,  1.8421,  2.3684,  2.8947,
         3.4211,  3.9474,  4.4737,  5.0000])

According to torch.arange an easy solution would be to add a small eps value instead of delta, e.g.

self._a_values = torch.arange(self._v_min, self._v_max + 10e-9, delta)

or cutoff the last value in the case when the tensor is too big or use some internal eps value instead of a hard coded one.

Is there a way to log the loss during training?

When I train my agent using DQN with the TorchApproximator, is there a way to log the loss and create a loss curve?
This would be very nice as an addition to the learningg curve since it gives further insight into the training behavior.

mushroomrl / mushroom-rl Goto Github PK

mushroom-rl's People

Contributors

Stargazers

Watchers

Forkers

mushroom-rl's Issues

Recommend Projects

Recommend Topics

Recommend Org