polixir / neorl Goto Github PK

View Code? Open in Web Editor NEW

102.0 5.0 12.0 21.64 MB

Python interface for accessing the near real-world offline reinforcement learning (NeoRL) benchmark datasets

Home Page: http://polixir.ai/research/neorl

License: Apache License 2.0

Python 100.00%

offline-reinforcement-learning

neorl's People

Contributors

Stargazers

Watchers

Forkers

stjordanis ssimonc huangshiyu13 khurrumsaleem gmmkmtgk cjmdd serapergun arnord victoriawong shercklo gwangpyo dtbinh

neorl's Issues

Next_Observation and Reward clamping in MOPO

Dear Authors,
In your MOPO implementation, when generating the transitions from the ensemble, you take under consideration the min/max of the training batch, as follows:

obs_max = torch.as_tensor(train_buffer['obs'].max(axis=0)).to(self.device)
obs_min = torch.as_tensor(train_buffer['obs'].min(axis=0)).to(self.device)
rew_max = train_buffer['rew'].max()
rew_min = train_buffer['rew'].min()

So that, prior to the computation of the penalised reward and the addition of the experience tuple in the batch, you can clamp the observation and the reward between the min and the max recovered above:

next_obs = torch.max(torch.min(next_obs, obs_max), obs_min)
reward = torch.clamp(reward, rew_min, rew_max)

Is there a particular reason behind this choice? I could not find a correspondence in the original MOPO implementation/publication, or is it simply due to other re-implementation needs, considering the different framework used?

Kind regards

Question regarding the dynamics pre-training

Dear Authors,
I can't get my head around a particular line of code in pretrain_dynamics.py, line 56.
There, the number of hidden units for each hidden layer in the ensemble depends on the task:

    hidden_units = 1024 if config["task"] in ['ib', 'finance', 'citylearn'] else 256

Thus, if I understand it correctly, each hidden layer in the 7 models would have 256 units in the Mujoco tasks, 1024 otherwise.
However, the paper states:

... For model-based approaches, the transition model is represented by a local Gaussian distribution ... ... by an MLP with 4 hidden layers and 256 units per layer ...

On the contrary, following the paper, the algo_init function for MOPO (as a model-based algorithm example), sets the number of hidden_units in the ensemble to the provided config value, which defaults to 256. Nonetheless, this ensemble is ignored if a pre-trained one is given.

    transition = EnsembleTransition(obs_shape, action_shape, args['hidden_layer_size'], args['transition_layers'], args['transition_init_num']).to(args['device'])

All things considered, is there a particular reason why the pretrain_dynamics script instantiates the hidden layers in the ensemble with 1024 units, instead of 256?
Or can I simply ignore this change, given the fact that the results in the paper, as stated, have been obtained by using the latter value?

Kind regards

Expert scores & random scores for normalization

I'm trying to get the normalized score of NeoRL mujoco tasks.
But, I could not find the expert scores & random scores inside the codebase nor in the paper.

Can you provide or guide me where I can get those scores?
(I think there should be somewhere, but I can not find it...)

Can I use NeoRL, to generate a dataset, in the d4rl format, e.g for the finance environment?

I see that the get_dataset() call returns a dict with most of the relevant information for d4l, I wanted to know how I may generate more than the available 10000 data points provided.

I'd like to use NeoRL, specifically, the finance env and generate a dataset in the d4rl format. That would be highly useful.

Thanks,
Krishna

Question regarding the reward of sales promotion training dataset

Hi,

In the sales promotion environment the reward is denoted by rew = (d_total_gmv - d_total_cost)/self.num_users which means the operator observes one single reward signal over all users. However, in the offline training dataset the reward is different for each user across 50 days. For example refer to the below user orders and reward graph

as per my understanding the reward should be same each day for the three users and gradually increase over 50 days with increase in sales. Could you kindly let me know how the reward in the training dataset was calculated.

Action space difference between datasset and environment

Hi, our team are training our model with NeoRL and find Action space difference between datasset and environment.

When excuting code below：

env = neorl.make('Citylearn')
low_dataset, _ = env.get_dataset(
    data_type="low",
    train_num=args.traj_num,
    need_val=False,
    use_data_reward=True,
)
action_low = env.action_space.low
action_high = env.action_space.high
print('action_low', action_low)
print('action_high', action_high)
print('dataset action_low', np.min(low_dataset['action'], axis=0))
print('dataset action_low', np.max(low_dataset['action'], axis=0))`

the output is bwlow , and action range is obviously different between dataset and env, which makes us confused.

action_low [-0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334
 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334
 -0.33333334 -0.33333334]
action_high [0.33333334 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334
 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334
 0.33333334 0.33333334]
dataset action_low [-3.5973904 -4.031006  -3.167992  -3.1832075 -3.4287922 -3.9067357
 -3.4079363 -3.3709202 -3.1863866 -4.1262846 -3.6601577 -4.087899
 -3.8954997 -3.312598 ]
dataset action_low [3.4334774 3.8551078 3.4849963 3.7777936 3.6103873 3.9329555 3.7596557
 3.7149396 4.0387006 3.3615265 3.946596  4.272308  3.4278386 3.3716872]

///

Baseline Policies and Raw Results

Hi,

Do you plan to open-source raw results (especially for the newer version of your paper)? This could be very helpful for computing other relevant metrics.
Do you plan to open-source baseline policies?

This data could be extremely helpful (considerably decrease needed compute time) for our research.

Update Aim version and add Aim running instructions on README

HI, Gev here - the author of Aim
Love your work on the NeoRL and would love to contribute.

Changes Proposal

Aim 3.x has been released a while ago and it's a much improved and better scalable version (especially for the RL cases).
Would love to update Aim to 3.x and add an instruction section on README so it's easier to run the benchmarks?

Motivation

To provide easy and smoother experience for the NeoRL users with Aim.

LSTM

How can this project implement Recurrent Neural Network?

Learning curves in IB

Hi,

i am executing benchmark scripts with IB datasets and I am not getting any results. The picture corresponds to the running of BCQ with the IB-Low-100 dataset. Each 10 training epochs I run 100 validation episodes and obtain its mean reward, the result is an horizontal line, no learning takes place.

Thank you for all you answers!

Unable to reproduce results for BCQ

Hi!
I was trying to reproduce results for BCQ but failed. For example, in the maze2d-large-v1 environment, the d4rl score given by this repo is around 20. In contrast, the d4rl score given by the original code for BCQ is around 30.

I tested the algorithm over three seeds and averaged the performance over the last 10 evaluations, so it does not seem to be resulting from a bad seed selection or big performance fluctuations. I also tried replacing hyperparameters in benchmark/OfflineRL/offlinerl/algo/modelfree/bcq.py and benchmark/OfflineRL/offlinerl/config/algo/bcq_config.py with the original ones, but it still failed.

Could you please figure that out and fix it? Thanks a lot!