mushroomrl / mushroom-rl Goto Github PK
View Code? Open in Web Editor NEWPython library for Reinforcement Learning.
License: MIT License
Python library for Reinforcement Learning.
License: MIT License
In [3]: import mushroom.algorithms.value
/home/tyrion/.local/lib/virtualenvwrapper/fungo/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-3-dae6f65bc28c> in <module>()
----> 1 import mushroom.algorithms.value
~/src/mushroom/mushroom/algorithms/value/__init__.py in <module>()
----> 1 from .batch_td import FQI, DoubleFQI, WeightedFQI, LSPI
2 from .dqn import DQN, DoubleDQN, AveragedDQN
3 from .td import QLearning, DoubleQLearning, WeightedQLearning, SpeedyQLearning,\
4 RLearning, RQLearning, SARSA, SARSALambdaDiscrete, SARSALambdaContinuous,\
5 ExpectedSARSA, TrueOnlineSARSALambda
ImportError: cannot import name 'WeightedFQI'
Describe the bug
I run simple DQN on breakout atari game and the memory slowly increases, and after 20-30 epochs it takes 64GB of memory and after that keeps increasing. I use 1 million for the replay memory, but I thought that in 4 epochs of 250k iterations the replay memory should be already full and the used RAM shouldn't increase after that. Am I right?
I'm training on CPU, but I guess this shouldn't influence the memory leak.
System information (please complete the following information):
Hi, I can not import the package. Error information is as follows:
File "mushroom_rl\environments\mujoco_envs\humanoid_gait\_external_simulation\muscle_simulation_stepupdate.pyx", line 1, in init mushroom_rl.environments.mujoco_envs.humanoid_gait._external_simulation.muscle_simulation_stepupdate ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
I'm trying to reproduce the results on breakout with DQN with mushroom, but I get much lower average rewards.
I expect to reach at least 300 as in the nature paper on DQN, but I reach 175 after 100 epochs.
I started from the example code where the parameters are already close to the nature paper, I only increased the replay memory to 1M, and tested different optimizers, but with no luck.
do you have any idea what I'm missing?
Describe the bug
While configuring dm_control with mushroom_rl, I am finding difficulty in configuring mujoco200. Following is the error:
AttributeError: /.mujoco/mujoco200/bin/libmujoco200.so: undefined symbol: mjr_label
Following are the environment variables set:
export MUJOCO_GL=osmesa
export MJLIB_PATH="/.mujoco/mujoco200/bin/libmujoco200.so"
export MJKEY_PATH="/.mujoco/mujoco200/mjkey.txt"
export MUJOCO_PY_MJPRO_PATH="/.mujoco/mujoco200/"
export MUJOCO_PY_MJKEY_PATH="/.mujoco/mujoco200/mjkey.txt"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/.mujoco/mujoco200/bin"
I am stuck for a long at this point. Kindly help
To Reproduce
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from mushroom_rl.algorithms.value import DQN
Expected behavior
No error
System information (please complete the following information):
Additional Context
mujoco_py works perfectly fine as it can render environments from python code.
Describe the bug
The order of constructor parameters in the agents is not consistant.
Examples:
Agent(self, policy, mdp_info, features=None)
DeepAC(self, policy, mdp_info, actor_optimizer, parameters)
A2C(self, mdp_info, policy, critic_params, **alg_params)
BlackBoxOptimization(self, distribution, policy, mdp_info, features=None)
Expected behavior
I would expect parameters to have the same order and positioning like in its superclass, where ever possible.
Examples:
Agent(self, policy, mdp_info, features=None)
DeepAC(self, policy, mdp_info, actor_optimizer, parameters)
A2C(self, policy, mdp_info, critic_params, **alg_params)
BlackBoxOptimization(self, policy, mdp_info, distribution, features=None)
System information (please complete the following information):
Two common performance metrics used in the RL literature are mean return and median return because the median is less influenced by outliers. However, compute_metrics()
does not compute the median. It would be straightforward to change the current line
return np.min(J), np.max(J), np.mean(J), len(J)
to
return np.min(J), np.max(J), np.mean(J), len(J), np.median(J)
The comments in gaussian_rbf.py say
Factory method to build uniformly spaced Gaussian radial basis functions
with a 25\% overlap.
What is meant by the "25% overlap" here? For example,
what percentage would these 2 Gaussians overlap with?
A 100% overlap can be thought of as 2 Gaussians of the same parameters. So, an overlap percentage is 100 times the ratio of the common area of 2 Gaussians to the area of one of the Gaussians (since the width parameter is the same across these 2 Gaussians)? If this isn't true, then please let me know what is the formal definition of the "overlap".
Hi,
I have a question regarding the performance of PPO in MushroomRL compared to StableBaselines3.
Is it safe to say that your implementation of PPO should achieve a similar performance (in terms of mean discounted reward) compared to the PPO implementation of StableBaselines3?
By looking at the code it seems sensible to compare the two performances, by setting some hyper-parameters of PPO from StableBaselines3 in order to be able to compare it with your implementation.
However when running some experiments I came across something different:
Running experiments on MushroomRL InvertedPendulum environment, your implementation achieves superior performance (in terms of mean discounted reward).
Running experiments on an LQG environment, StableBaselines3 reaches the theoretical optimal policy whereas your implementation is extremely far off.
In both cases I used the same number of steps, the same hyper-parameters, the same network architecture, the same policy, the same environment and the same evaluation method.
Are you surprised by this? If not, why?
Thank you in advance
I'm working on a problem with small time-series data. I wrote an environment myself in mushroom.
I wanted to see if DQN where the Q-network is a GRU would help. But I keep running into problems, mainly this:
RuntimeError: Expected object of scalar type Double but got scalar type Float for argument #3 'mat2' in call to _th_addmm_out
I tried agents.approximator.model.network = agents.approximator.model.network.double()
and setting the data I use to double() in every step but still it won't work. I'm wondering if somewhere along the line in mushroom things get casted to floats?
Please let me know if anyone has any insight!
I have also attached my network I know it looks gross but I have been trying to track down the problem.
`class GRUNetwork(nn.Module):
n_features = 64
def __init__(self, input_shape, output_shape, **kwargs):
super().__init__()
n_input = input_shape[0]
n_output = output_shape[0]
self._gru = nn.GRU(input_size=n_input,hidden_size=self.n_features,batch_first=True)
self._h1 = nn.Linear(self.n_features,32)
self._h2 = nn.Linear(32,16)
self._h3 = nn.Linear(16, n_output)
nn.init.xavier_uniform_(self._h1.weight,
gain=nn.init.calculate_gain('relu'))
nn.init.xavier_uniform_(self._h2.weight,
gain=nn.init.calculate_gain('relu'))
nn.init.xavier_uniform_(self._h3.weight,
gain=nn.init.calculate_gain('linear'))
def forward(self, state, action=None):
state = state.type(torch.DoubleTensor)
#print( state.dtype, state.shape, state.dim() )
if state.dim() == 2:
state = state.unsqueeze(1)
if state.shape[0]==256:
print(state[0].dtype)
print(state[0][0].dtype)
print(state[0])
self._gru = self._gru.double()
h,_ = self._gru(state.type(torch.DoubleTensor))
h = F.relu(h.double())
h = F.relu(self._h1(h.double()))
h = F.relu(self._h2(h.double()))
q = self._h3(h.double())
if action is None:
return q
else:
q_acted = torch.squeeze(q.gather(1, action.long()))
return q_acted`
Can mushroom-rl support multi-agent or will add in the future?
I see an example script for DQN, but I'd like to benchmark Atari with other models including my own custom ones. Is there an example of how I can use mushroom-rl to do this?
Using the code in mushroom-rl/docs/source/tutorials/code/dqn.py I trained a network to play Atari Pong. After training, I ran 1000 episodes of one game each to check the winrate of the network versus the Atari computer player. For about a quarter of the episodes the reward reported by core.evaluate was zero. I wonder what this means and if it is an intended result.
Pong is two-player game. Each player scores points and a game ends when one player gets a score of 21. The game can not end in a tie. A natural definition of the reward for a game would be score(network) - score(Atari player). This can not be zero. What is the intended reward if it is not the natural definition?
If someone tries to train QLearning on more than a single step, an assertion is thrown because the length of the dataset is longer than 1.
def fit(self, dataset):
assert len(dataset) == 1
state, action, reward, next_state, absorbing = self._parse(dataset)
self._update(state, action, reward, next_state, absorbing)
The only way I've found to work around this is to train QLearning every step. But this shouldn't need to be the case; QLearning can also be trained at the end of an episode, no problem.
Describe the bug A clear and concise description of what the bug is.
In hope of getting features {I(a=L),I(a=L)s,I(a=L)s^2,I(a=R),I(a=R)s,I(a=R)s^2}, I used:
basis = PolynomialBasis.generate(max_degree=2, input_size=2)
features = Features(basis_list=basis)
Although, I got an error such as:
File "examples/simplechain_lspi.py", line 77, in
steps = experiment()
File "examples/simplechain_lspi.py", line 57, in experiment
core.evaluate(n_episodes=2, render=True)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/core/core.py", line 94, in evaluate
initial_states)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/core/core.py", line 123, in _run
episodes_progress_bar, render, initial_states)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/core/core.py", line 139, in _run_impl
sample = self._step(render)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/core/core.py", line 185, in _step
action = self.agent.draw_action(self._state)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/algorithms/agent.py", line 48, in draw_action
state = self.phi(state)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/features/_implementations/basis_features.py", line 21, in call
out[i] = bf(s)
File "/Users/kishanpb/Documents/Github/mushroom-rl/mushroom_rl/features/basis/polynomial.py", line 41, in call
out *= x[i]**d
IndexError: index 1 is out of bounds for axis 0 with size 1
To Reproduce Provide a snippet of code or a Python file. I used cartpole_lspi.py code but with the environment: ```
mdp = generate_simple_chain(state_n=5, goal_states=[2], prob=.8, rew=1,
gamma=.9)
[simplechain_lspi.py.zip](https://github.com/MushroomRL/mushroom-rl/files/4340862/simplechain_lspi.py.zip)
**System information (please complete the following information):**
- OS: Mac 10.13.6
- Python version: Python3.7
- Torch version: Pytorch 1.3
- Mushroom version: master
Is your feature request related to a problem? Please describe.
I would like to reduce the memory taken by RL with atari, so I can run many experiments at the same time.
Describe the solution you'd like
compress the frames, for example in rllib they use LZ4 compression. This is different from the lazyframes
if I want to implement this by myself, where should I make the change?
Describe the bug
I modified DQN to enable n_steps DQN, but I get worse results, am I missing something?
To Reproduce
use dqn with this function and defining self.n_steps in the init:
def _fit_standard(self, dataset):
self._replay_memory.add(dataset, n_steps_return=self.n_steps, gamma=self.mdp_info.gamma)
if self._replay_memory.initialized:
state, action, reward, next_state, absorbing, _ = \
self._replay_memory.get(self._batch_size())
if self._clip_reward:
reward = np.clip(reward, -1, 1)
q_next = self._next_q(next_state, absorbing)
gamma = self.mdp_info.gamma ** self.n_steps * (1 - absorbing)
q = reward + gamma * q_next
self.approximator.fit(state, action, q, **self._fit_params)
Expected behavior
dqn with 2 or 3 steps is worse than 1 step dqn for atari breakout and lunar, I'm not sure if it's a bug or if it's supposed to be worse. in any case it would be nice to have the dqn n_steps implemented in mushroom_rl
System information (please complete the following information):
thanks
Hi,
The Segway environment is not loaded in mushroom_rl/environments/__init__.py
, which is confusing when you try to import it. Since it is the only one missing I assume you simply forgot to put it there?
Best,
Robert
Is your feature request related to a problem? Please describe.
I was wondering why the predict method in class TorchApproximator calculates the gradients and calls self.network.forward(*torch_args, **kwargs)
.
Describe the solution you'd like
Why not use the with torch.no_grad()
statement to save memory on the one hand and to omit the detach()
call on the other hand. Further, calling self.network(*torch_args, **kwargs)
instead of forward is better practice (if there's no good reason for doing otherwise).
Describe alternatives you've considered
If it is the desired behaviour, that the output_tensor
flag means a tensor with gradients should be returned, the first point is void.
Additional context
Additionally, in line 254 in the same file, the .requires_grad_(False)
has no effect since the .detach()
call has already taken care of it.
I was trying to make LSPI work for Acrobot, but it isn't working. Used basis functions like:
# basis 1
basis = [PolynomialBasis()]
s1 = np.array([-np.pi, 0, np.pi]) * .25
s2 = np.array([-2*np.pi, 0, 2*np.pi])
s3 = np.array([-4*np.pi, 0, 4*np.pi])
s = np.array(np.meshgrid(s1,s1,s1,s1,s2,s3)).T.reshape(-1,6)
for i in s:
basis.append(GaussianRBF(i, np.array([2.])))
# basis 2
basis=GaussianRBF.generate(n_centers=[3,3,3,3,3,3], low=[-1,-1,-1,-1,-4*np.pi,-9*np.pi],\
high=[1,1,1,1,4*np.pi,9*np.pi])#, dimensions=[1,1,1,1])
basis.append(PolynomialBasis())
# basis 3
basis = PolynomialBasis.generate(max_degree=4, input_size=6)
Does anyone have a working basis function for Acrobot or any other environment?
I trained an Qlearning agent in one environment and want to use that same trained agent in another slightly different environment. How can I do that ?
Can't install the package
./python.exe -m pip install mushroom-rl
System information:
Logs
Building wheel for mushroom_rl (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for mushroom_rl (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [344 lines of output]
...
LINK : fatal error LNK1104: cannot open file 'build\temp.win-amd64-cpython-310\Release\mushroom_rl/environments/mujoco_envs/humanoid_gait/_external_simulation\muscle_simulation_stepupdate.cp310-win_amd64.exp'
error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.33.31629\\bin\\HostX86\\x64\\link.exe' failed with exit code 1104
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for mushroom_rl
Failed to build mushroom_rl
ERROR: Could not build wheels for mushroom_rl, which is required to install pyproject.toml-based projects
In my experiments the conjugate gradient method in TRPO does not seem work quite right.
The residual is getting smaller if any very slow and is not even close to the tolerance (1e-2 instead of 1e-10) after the default amount iterations (10). Thus, the cg method never stops early and neither returns exact solution.
Furthermore, I checked the progress on the objective function of the related minimization problem f(x) = 1/2 * x.T * A * x - b*x
Interestingly, the values seem to increase instead of decrease monotonically.
I added the following code in line 158 in trpo.py to monitore the progress.
Fx = self._fisher_vector_product(x, obs, old_pol_dist).detach().cpu().numpy()
print(0.5*x.dot(Fx) - b.detach().cpu().numpy().dot(x))
After some debugging, I think the problem lies in the fisher vector product. The general CG-method worked in examples where I replaced the fvp with some normal matrix vector mulitplication.
I am not sure if this problem has any effect on the overall performance of TRPO. Nevetheless, the result of the conjugate gradient defines the search direction of the line search and is therefore a main part of the optimization.
First of all, thank you for setting up such an amazing an easy to use library. While working with mushroom_rl on Finite MDPs, I stumbled along the solving process inside the policy_iteration
function located in ./solvers/dynamic_programming.py
.
Line 76 states
value = np.linalg.inv(i - gamma * p_pi).dot(r_pi)
Here the computation can be easily substituted by just solving the corresponding linear system via
value = np.linalg.solve(i - gamma * p_pi, r_pi)
This version should be way faster (as np.inv()
internally solves the systems np.inv()
to large dense matrices).
Regards,
Florian
Problem:
While trying to run the example montain_car_sarsa.py I got the error TypeError: 'NoneType' object is not callable
. By inspecting the ./environment/__init__.py
file, it was clear that openai/gym was missing.
Solution:
pip install gym
solved the issue, as well as addition the additional line gym
in requirements.txt
should solve this by default.
Is there an option to discretize the state space of an environment included in this repo?
Dyna-Q is a conceptual algorithm that illustrates how real and simulated experience can be combined in building a policy. Planning in RL terminology refers to using simulated experience generated by a model to find or improve a policy for interacting with a modeled environment
Any plans on having this agent in mushrrom rl ?
Additional context
Add any other context or screenshots about the feature request here.
Hi.
I want to create totally new vectorized environment that is not presented by any library.
And I tried to create it using the MushroomRL library.
However, though I read and followed the MushrromRL Document, I couldn't generate my custom environment.
URL I referenced:
https://mushroomrl.readthedocs.io/en/latest/source/tutorials/tutorials.5_environments.html#creating-a-new-environment
I thought I could create a custom environment with the MushroomRL library.
Is there anything I'm missing or misunderstood?
I'm trying to implement a simple REINFORCE agent on Gridworld
. However, I keep hitting the following error:
File "/home/rylan/Documents/GanguliGang-Metacognitive-Actor-Critic/mac_venv/lib/python3.6/site-packages/mushroom_rl/core/core.py", line 141, in _run_impl
sample = self._step(render)
File "/home/rylan/Documents/GanguliGang-Metacognitive-Actor-Critic/mac_venv/lib/python3.6/site-packages/mushroom_rl/core/core.py", line 188, in _step
action = self.agent.draw_action(self._state)
File "/home/rylan/Documents/GanguliGang-Metacognitive-Actor-Critic/mac_venv/lib/python3.6/site-packages/mushroom_rl/core/agent.py", line 65, in draw_action
return self.policy.draw_action(state)
File "/home/rylan/Documents/GanguliGang-Metacognitive-Actor-Critic/mac_venv/lib/python3.6/site-packages/mushroom_rl/policy/td_policy.py", line 149, in draw_action
return np.array([np.random.choice(self._approximator.n_actions,
AttributeError: 'NoneType' object has no attribute 'n_actions'
It appears that the policy needs to be initialized with an approximator. I would really appreciate a simple tutorial showing how to create an approximator and a policy on a simple environment.
Thanks in advance!
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is.
Stable baselines 3 supports hyperparameter tuning with Optuna. It would be really helpful if this is was supported in Mushroom-RL. It's a bit difficult now since all the training logic is hidden in core.learn().
Hi, this library is great! Is it compatible with continuous control tasks from pixels?
I wound like to run Mujoco environments with MushroomRL. However, on my computer, running PPO alogirithm for 1M steps in HalfCheetah would take like more than 1 hour. I think environment parallelization can greatly reduce the required time but haven't figured out the way to use vector environments. Any suggestions?
can you implement or give me some guidance to implement new spaces type like in Gymnasium.
There are some important spaces such as Dict and Tuple which I'm requiring.
Thanks in advance
Describe the bug
If I save an agent (even with full save) where I have a LinearParameter as epsilon for EpsGreedy, and I load the agent, the epsilon is a Parameter and not a LinearParameter, so I cannot continue with the same EpsGreedy policy.
My goal is to save an agent so I can resume the training.
To Reproduce
epsilon = LinearParameter(value=args.initial_exploration_rate,
threshold_value=args.final_exploration_rate,
n=args.final_exploration_frame)
pi = EpsGreedy(epsilon=epsilon)
[...]
agent = alg(mdp.info, pi, approximator,
approximator_params=approximator_params,
**algorithm_params)
agent.save('agent.msh', full_save=True)
[...]
agent = DQN.load('agent.msh')
#agent.policy._epsilon is a Parameter and not a LinearParameter
Expected behavior
I expect agent.policy._epsilon to be the same type when I reload an agent from disk
System information (please complete the following information):
I want to explore policy gradient and actor critic agents on GridWorld
environments. To that end, I want to parameterize the policy as a Categorical distribution at each state. How do I do this?
Looking through the available policies, policy.td_policy.Boltzmann
appears to perform softmax(logits), which is what I have in mind, but its logits appear to be dictated by Q values:
q_beta = self._approximator.predict(state, **self._predict_params) * self._beta(state)
q_beta -= q_beta.max()
qs = np.exp(q_beta)
I don't want the policy gradient agents to learn a Q function, and the fact that Boltzmann
is under td_policy
is making me hesitate because policy gradient methods are not a form of TD learning.
Hi,
I am working on a library that uses MushroomRL at its core.
I want to be able to use any kind of function approximator, even those that do not come from sklearn but this is not always possible.
The problem is in the module: regressor.py (https://github.com/MushroomRL/mushroom-rl/blob/dev/mushroom_rl/approximators/regressor.py). Specifically in the block of lines 49-51 in which you add to the params dictionary the input_shape and the output_shape, like so:
params['input_shape'] = input_shape
params['output_shape'] = output_shape
You do this only if the approximator module does not start with 'sklearn'.
Now suppose that I use XGBoostRegressor as approximator: everything runs smoothly, I just get a warning from xgboost saying Parameters: { input_shape, output_shape } might not be used
This is all fine until I start using other approximators, such as CatBoostRegressor: here the initialisation of the object CatBoostRegressor fails: __init__() got an unexpected keyword argument 'input_shape'
Obviously I could simply change the current if condition, from: if not approximator.__module__.startswith('sklearn')
to include also other modules, but this is just a bad work-around.
Are you already planning a fix so that it will be possible to use other approximators such as CatBoostRegressor?
Thanks.
Currently the car on the hill FQI example works with the ExtraTreesRegressor. Is it possible for it to work with much simpler regressors, say GaussianProcessRegressor?
I tried using the sklearn.gaussian_process.GaussianProcessRegressor with kernel 1.0 * RBF(1.0), but the approximator gets stuck in the first optimization update! I'll be grateful for any possible tricks to make it work.
If I can make a suggestion, perhaps you could rename the function mushroom_rl.utils.dataset.episodes_length()
to mushroom_rl.utils.dataset.compute_episodes_length()
. This would better match your function naming convention (e.g. compute_J()
, compute_metrics()
) and would also allow me to name the returned variable episode_lengths
.
Thanks in advance!
Is your feature request related to a problem? Please describe.
The performance of RL algorithms is very sensitive to the implementation (including tricks not mentioned in their papers). A good RL package should have benchmarks of how well its implementations perform as quantitative checks.
Describe the solution you'd like
Would it be possible to maintain a set of benchmarks for each of the proposed algorithm on standard tasks so that users can be sure that the implementations are faithful to the source code released by the original authors?
Even if the implementations here can't fully reproduce the results in papers, it would be good to benchmark the level of performance one can expect from the implementations in this package.
Describe alternatives you've considered
None
Describe the solution you'd like
For checking the mushroom version it would be good, if it can be accessed in python after import.
import mushroom
print(mushroom.__version__)
Since the Cartpole environment in gym has a 4-tuple state, I used the following to define the features in cartpole_lspi.py:
s1 = np.array([-2, -1, 0, 1, 2])
s2 = np.array([-1, 0, 1])
s3 = np.array([-np.pi, 0, np.pi]) * .125
s4 = np.array([-1, 0, 1])
s = np.array(np.meshgrid(s1,s2,s3,s4)).T.reshape(-1,4)
for i in s:
basis.append(GaussianRBF(i, np.array([1.])))
To use the Cartpole environment from gym, I changed
mdp = CartPole()
to
```
horizon = 200
gamma = 0.95
mdp = Gym('CartPole-v0', horizon, gamma)
With this, LSPI wasn't able to learn. Has anyone tried this before? If not, any suggestions on what else to try to make LSPI learn Cartpole from gym?
Stable Baselines 3 opens with showing how to train an agent in one of two ways: first with a custom loop, the second with their .train()
method
https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html
Approach 1:
import gym
from stable_baselines3 import A2C
env = gym.make('CartPole-v1')
model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()
Approach 2:
from stable_baselines3 import A2C
model = A2C('MlpPolicy', 'CartPole-v1').learn(10000)
I can find mushroom-rl
's equivalent of Approach 2, but I can't find the equivalent of Approach 1. Could someone please provide a tutorial or demonstration?
Thank you in advance!
REINFORCE, as implemented, always uses a baseline. I would like a flag that makes the baseline optional, specifically to explore the effect that a baseline has on REINFORCE.
Hi,
I'm currently setting a seed value for the environment using its seed methods, but when I run it multiple times I get very different results.
I know this issue involves many variables that the results may diverge, but I was wondering if there is any another parameter to be set in order to reduce this divergence?
I'm using your advance experiment as base to run a Gym.CartPole environment with LinearDecay epsilon and learning rate parameters just to learn how your library works, because this problem I coded from scratch with Q-learning with successfully results.
My code is something like this
environment = Gym(name='CartPole-v0', horizon=np.inf, gamma=1.)
environment.seed(seed_val)
# Policy
linear_epsilon = LinearParameter(0.9, 0.1, n=episodes//2)
pi = EpsGreedy(epsilon=linear_epsilon)
# state codification
n_tilings = 1
tilings = Tiles.generate(n_tilings, [1, 1, 6, 3],
environment.info.observation_space.low,
environment.info.observation_space.high)
features = Features(tilings=tilings)
approximator_params = dict(input_shape=(features.size,),
output_shape=(environment.info.action_space.n,),
n_actions=environment.info.action_space.n)
# agent
linear_alpha = LinearParameter(0.5, 0.1, n=episodes//2)
agent = SARSALambdaContinuous(environment.info, pi, LinearApproximator,
approximator_params=approximator_params,
learning_rate=linear_alpha,
lambda_coeff=.9, features=features)
Currently is not learning, but that is not the issue, just the different obtained results.
Thanks,
Describe the bug
Hi, I am getting the following error. I’m trying to use the A2C algorithm, which samples actions from a Gaussian distribution when given a state. The code seems expects torch.float32. I have checked and my inputs are indeed torch.float32 not Long. I’m not sure what to do. I know the algorithm runs as I have run it before, but I get this error when I try to use it in an environment with discrete actions.
Here is the library’s draw_action function:
def draw_action_t(self, state):
print('draw_action_t',state.dtype)
return self.distribution_t(state).sample().detach()
and
def distribution_t(self, state):
mu, sigma = self.get_mean_and_covariance(state)
return torch.distributions.MultivariateNormal(loc=mu, covariance_matrix=sigma)
Final error:
RuntimeError: _th_normal_ not supported on CPUType for Long
Help is much appreciated!
System information (please complete the following information):
Additional context
I'm working with the Gym 'CartPole-v0' environment.
I need these deep policy gradient methods to work for discrete environments as I am testing something for my research. Any advice on this?
Help is greatly appreciated!
Describe the bug
I get NotImplementedError
when I try to set seed on some of the envrionments.
To Reproduce
env = PuddleWorld()
env.seed(seed)
Expected behavior
I expected seed to be set
System information (please complete the following information):
Additional context
Error message I get
File "/python3.8/site-packages/mushroom_rl/core/environment.py", line 137, in seed
raise NotImplementedError
NotImplementedError
I'm trying to use the PPO for the lunar lander but I can't find examples and my code doesn't seem to converge, can you spot the issue? some parameter is wrong?
alg = PPO
from mushroom_rl.policy import BoltzmannTorchPolicy
policy_params = dict(
std_0=1.,
n_features=32,
use_cuda=torch.cuda.is_available()
)
algorithm_params = dict(
batch_size=128,
actor_optimizer=optimizer,
n_epochs_policy=4,
eps_ppo=.2, lam=.95,
critic_params=dict(network=net,
optimizer=optimizer,
loss=F.mse_loss,
n_features=32,
batch_size=128,
input_shape=mdp.info.observation_space.shape,
output_shape=(1,))
)
beta = Parameter(1e0)
policy = BoltzmannTorchPolicy(net, mdp.info.observation_space.shape,
mdp.info.action_space.shape,
beta, **policy_params)
agent = alg(mdp.info, policy, **algorithm_params)
[...]
core.learn(n_steps=10000,
n_steps_per_fit=3000, quiet=args.quiet)
In mushroom_rl.algorithms.policy_search.policy_gradient.reinforce
, the gradient is computed as:
def _compute_gradient(self, J):
baseline = np.mean(self.baseline_num, axis=0) / np.mean(self.baseline_den, axis=0)
...
However, the baseline numerator and denominator are calculated as:
self.list_sum_d_log_pi.append(self.sum_d_log_pi)
squared_sum_d_log_pi = np.square(self.sum_d_log_pi)
self.baseline_num.append(squared_sum_d_log_pi * self.J_episode)
self.baseline_den.append(squared_sum_d_log_pi)
We can see here that the baseline_num
and baseline_den
both have the same shape as d_log_pi
, meaning baseline
has the same shape. Surely the shape of the baseline should be a scalar?
For an even number of atoms, the calculation of self._a_values
(see here) does not seem to be 100% correct. This behavior is reproducible via
import torch
v_min = -5
v_max = 5
n_atoms = 20
delta = (v_max - v_min) / (n_atoms - 1) # delta = 0.5263157894736842
torch.arange(v_min, v_max + delta, delta)
which yields
tensor([-5.0000, -4.4737, -3.9474, -3.4211, -2.8947, -2.3684, -1.8421, -1.3158,
-0.7895, -0.2632, 0.2632, 0.7895, 1.3158, 1.8421, 2.3684, 2.8947,
3.4211, 3.9474, 4.4737, 5.0000, 5.5263])
and is too big. The expected result would be this tensor:
tensor([-5.0000, -4.4737, -3.9474, -3.4211, -2.8947, -2.3684, -1.8421, -1.3158,
-0.7895, -0.2632, 0.2632, 0.7895, 1.3158, 1.8421, 2.3684, 2.8947,
3.4211, 3.9474, 4.4737, 5.0000])
According to torch.arange an easy solution would be to add a small eps
value instead of delta
, e.g.
self._a_values = torch.arange(self._v_min, self._v_max + 10e-9, delta)
or cutoff the last value in the case when the tensor is too big or use some internal eps
value instead of a hard coded one.
When I train my agent using DQN with the TorchApproximator, is there a way to log the loss and create a loss curve?
This would be very nice as an addition to the learningg curve since it gives further insight into the training behavior.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.