polixir / neorl Goto Github PK
View Code? Open in Web Editor NEWPython interface for accessing the near real-world offline reinforcement learning (NeoRL) benchmark datasets
Home Page: http://polixir.ai/research/neorl
License: Apache License 2.0
Python interface for accessing the near real-world offline reinforcement learning (NeoRL) benchmark datasets
Home Page: http://polixir.ai/research/neorl
License: Apache License 2.0
Dear Authors,
In your MOPO implementation, when generating the transitions from the ensemble, you take under consideration the min/max of the training batch, as follows:
obs_max = torch.as_tensor(train_buffer['obs'].max(axis=0)).to(self.device)
obs_min = torch.as_tensor(train_buffer['obs'].min(axis=0)).to(self.device)
rew_max = train_buffer['rew'].max()
rew_min = train_buffer['rew'].min()
So that, prior to the computation of the penalised reward and the addition of the experience tuple in the batch, you can clamp the observation and the reward between the min and the max recovered above:
next_obs = torch.max(torch.min(next_obs, obs_max), obs_min)
reward = torch.clamp(reward, rew_min, rew_max)
Is there a particular reason behind this choice? I could not find a correspondence in the original MOPO implementation/publication, or is it simply due to other re-implementation needs, considering the different framework used?
Kind regards
Dear Authors,
I can't get my head around a particular line of code in pretrain_dynamics.py, line 56.
There, the number of hidden units for each hidden layer in the ensemble depends on the task:
hidden_units = 1024 if config["task"] in ['ib', 'finance', 'citylearn'] else 256
Thus, if I understand it correctly, each hidden layer in the 7 models would have 256 units in the Mujoco tasks, 1024 otherwise.
However, the paper states:
... For model-based approaches, the transition model is represented by a local Gaussian distribution ... ... by an MLP with 4 hidden layers and 256 units per layer ...
On the contrary, following the paper, the algo_init function for MOPO (as a model-based algorithm example), sets the number of hidden_units in the ensemble to the provided config value, which defaults to 256. Nonetheless, this ensemble is ignored if a pre-trained one is given.
transition = EnsembleTransition(obs_shape, action_shape, args['hidden_layer_size'], args['transition_layers'], args['transition_init_num']).to(args['device'])
All things considered, is there a particular reason why the pretrain_dynamics script instantiates the hidden layers in the ensemble with 1024 units, instead of 256?
Or can I simply ignore this change, given the fact that the results in the paper, as stated, have been obtained by using the latter value?
Kind regards
I'm trying to get the normalized score of NeoRL mujoco tasks.
But, I could not find the expert scores & random scores inside the codebase nor in the paper.
Can you provide or guide me where I can get those scores?
(I think there should be somewhere, but I can not find it...)
Something similar to what is done here - https://github.com/rail-berkeley/d4rl/blob/master/scripts/generation/generate_ant_maze_datasets.py
I see that the get_dataset() call returns a dict with most of the relevant information for d4l, I wanted to know how I may generate more than the available 10000 data points provided.
I'd like to use NeoRL, specifically, the finance env and generate a dataset in the d4rl format. That would be highly useful.
Thanks,
Krishna
Hi,
In the sales promotion environment the reward is denoted by rew = (d_total_gmv - d_total_cost)/self.num_users which means the operator observes one single reward signal over all users. However, in the offline training dataset the reward is different for each user across 50 days. For example refer to the below user orders and reward graph
as per my understanding the reward should be same each day for the three users and gradually increase over 50 days with increase in sales. Could you kindly let me know how the reward in the training dataset was calculated.
Hi, our team are training our model with NeoRL and find Action space difference between datasset and environment.
When excuting code below:
env = neorl.make('Citylearn')
low_dataset, _ = env.get_dataset(
data_type="low",
train_num=args.traj_num,
need_val=False,
use_data_reward=True,
)
action_low = env.action_space.low
action_high = env.action_space.high
print('action_low', action_low)
print('action_high', action_high)
print('dataset action_low', np.min(low_dataset['action'], axis=0))
print('dataset action_low', np.max(low_dataset['action'], axis=0))`
the output is bwlow , and action range is obviously different between dataset and env, which makes us confused.
action_low [-0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334
-0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334
-0.33333334 -0.33333334]
action_high [0.33333334 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334
0.33333334 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334
0.33333334 0.33333334]
dataset action_low [-3.5973904 -4.031006 -3.167992 -3.1832075 -3.4287922 -3.9067357
-3.4079363 -3.3709202 -3.1863866 -4.1262846 -3.6601577 -4.087899
-3.8954997 -3.312598 ]
dataset action_low [3.4334774 3.8551078 3.4849963 3.7777936 3.6103873 3.9329555 3.7596557
3.7149396 4.0387006 3.3615265 3.946596 4.272308 3.4278386 3.3716872]
///
Hi,
This data could be extremely helpful (considerably decrease needed compute time) for our research.
HI, Gev here - the author of Aim
Love your work on the NeoRL and would love to contribute.
Aim 3.x has been released a while ago and it's a much improved and better scalable version (especially for the RL cases).
Would love to update Aim to 3.x and add an instruction section on README so it's easier to run the benchmarks?
To provide easy and smoother experience for the NeoRL users with Aim.
How can this project implement Recurrent Neural Network?
Hi,
i am executing benchmark scripts with IB datasets and I am not getting any results. The picture corresponds to the running of BCQ with the IB-Low-100 dataset. Each 10 training epochs I run 100 validation episodes and obtain its mean reward, the result is an horizontal line, no learning takes place.
Thank you for all you answers!
Hi!
I was trying to reproduce results for BCQ but failed. For example, in the maze2d-large-v1
environment, the d4rl score given by this repo is around 20. In contrast, the d4rl score given by the original code for BCQ is around 30.
I tested the algorithm over three seeds and averaged the performance over the last 10 evaluations, so it does not seem to be resulting from a bad seed selection or big performance fluctuations. I also tried replacing hyperparameters in benchmark/OfflineRL/offlinerl/algo/modelfree/bcq.py
and benchmark/OfflineRL/offlinerl/config/algo/bcq_config.py
with the original ones, but it still failed.
Could you please figure that out and fix it? Thanks a lot!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.