sfujim / bcq Goto Github PK

Author's PyTorch implementation of BCQ for continuous and discrete actions

License: MIT License

Python 100.00%

bcq's Introduction

Batch-Constrained Deep Q-Learning (BCQ)

Batch-Constrained deep Q-learning (BCQ) is the first batch deep reinforcement learning, an algorithm which aims to learn offline without interactions with the environment.

BCQ was first introduced in our ICML 2019 paper which focused on continuous action domains. A discrete-action version of BCQ was introduced in a followup Deep RL workshop NeurIPS 2019 paper. Code for each of these algorithms can be found under their corresponding folder.

Bibtex

@inproceedings{fujimoto2019off,
  title={Off-Policy Deep Reinforcement Learning without Exploration},
  author={Fujimoto, Scott and Meger, David and Precup, Doina},
  booktitle={International Conference on Machine Learning},
  pages={2052--2062},
  year={2019}
}

@article{fujimoto2019benchmarking,
  title={Benchmarking Batch Deep Reinforcement Learning Algorithms},
  author={Fujimoto, Scott and Conti, Edoardo and Ghavamzadeh, Mohammad and Pineau, Joelle},
  journal={arXiv preprint arXiv:1910.01708},
  year={2019}
}

bcq's People

Contributors

Stargazers

Watchers

Forkers

fraglegs aadit fedorajzf jdc08161063 wwxfromtju youssefzaky msohcw haanvid ccplxx agarwl vashishtmadhavan chenjunjie1994 amandlek stjordanis durgaprasd arjunchandra afcarl yisu0005 pathway fiberleif sweetice shaswot-utokyo alhamzah jasonma2016 dangraves7 antoinetheb leela93 lineojcd williamd4112 5l1v3r1 yaoliucs megayeye hezez c0derzer0 pswazinna lalapo goingmyway minimum-laytonc shitianyu-hue hanchenresearch halesmith annaluo676 guillemdb asonabend whoiszyc futurehau ercumentilhan anirudhajitani reinakousaka reinholdm jieli18 louis-bagot makotohagiwara yueh-h cbynon hyyh28 mail-ecnu wakamori endxxxx cedesu hebaoxianga baldwin054212 webprogrammer77 hyiche integritynoble jajimer oximi123 hahahacode wuzhihao7788 landoufulxf darrenzhang01 ibozkurt79 alvinlxs nazhangai gilesemiya kschweig yskim525 shenliao masterofcs junqi98 morphline julie-nuaa salvador-flores-cl bic4907 weihancool guyuwuyu 4ever-rain shism2 lizhuo-1994 ai-for-games cuckoong xtwentian3 zzzzzczc weiaif lhx121 lawrancexcl dmksjfl shichao2023 houmaolin myildirimm

bcq's Issues

Curious about figure 1.f for the true value and the estimation

Hi, I think BCQ addresses the extrapolation error very well. And I'm curious about how to plot the figure 1.f as the paper show based on the collected offline batch. Great thanks for you reply

Training BCQ while evaluating policy with environment?

Hi,
Thanks for sharing the source code. I am trying to implement your code on my environment: I have my {action, state, next_state, reward, not_done, ptr} npy files all prepared in the ./buffer folder. Then I start to train my BCQ, however I saw that train_BCQ in main.py in the "while training_iters < args.max_timesteps: loop is calling eval_policy, I thought batch RL does not interact with environment. If so how do I set the env.step in my environment?

A potential bug in discrete BCQ FC network

Hello, first of all, thanks for this great work!
I found an issue with this line:

BCQ/discrete_BCQ/discrete_BCQ.py

Line 53 in 9690927

i = F.relu(self.i3(i))

I don't think the final output of imitation network logits should be passed by a ReLU layer. If I remove this layer, the code can work on toy examples with the discrete setting.
Also, the convolution network doesn't have the ReLU layer in the imitation logits final output. So I think this should be removed. Please correct me if I'm wrong.

Can you please post software packages configuration?

Thank you for the clean codebase!

I'm trying to reproduce the results in the paper, but I'm unable to obtain an expert with good performance while running the code as is.

I suspect it's because of software version mismatch. Can you please post the software packages configuration that you used to run the code?

Thank you!

The packages in my conda environment can be found below. With these package versions, I was only able to obtain a policy with episode return approximately 230 on Hopper-v1, which is quite low.

conda_packages.pdf

Whether the "done" condition was used incorrectly in the discrete action branch?

First of all, thanks for sharing the source code.

My question is whether the "done" condition was used incorrectly in the discrete action branch?

BCQ/discrete_BCQ/discrete_BCQ.py

Line 137 in 4876f7e

    
           target_Q = reward + done * self.discount * q.gather(1, next_action).reshape(-1, 1)

I guess it should be
target_Q = reward + (1-done) * self.discount * q.gather(1, next_action).reshape(-1, 1),
when the flag is "done" instead of "not_done".

Looking forward to your reply.

bug?

imt = (imt/imt.max(1, keepdim=True)[0] > self.threshold).float()

should be

imt = (imt/imt.max(1, keepdim=True) > self.threshold).float()

Some questions about the experiments for demonstrating extrapolation error.

Hello, I am currently studying offline reinforcement learning and came across BCQ. It's a great work worth delving into. However, I have some questions regarding the paper that I'd like to clarify and ensure that I haven't misunderstood. My questions might be numerous, but I genuinely want to understand the experimental details in the paper.

Here are my questions:

In Figure 1, does "Off-policy DDPG" refer to DDPG trained using a fixed dataset without interaction with environment? Additionally, as a benchmark for comparison, does "Behavioral" refer to DDPG trained using the normal training process?
In Figure 1, for the three experiments with different buffers, is "Final" understood as training a Behavioral DDPG, recording transitions during the training process, and then using the final buffer as a dataset for Off-Policy DDPG training (with no new transitions added to the buffer during training)? Can "Concurrent" be simply understood as Off-Policy DDPG gradually using transitions from early-stage to late-stage during the training process, rather than having the chance to sample late-stage transitions right from the beginning?
In Figure 1, the orange horizontal lines in (a)(c) represent calculating the episode average return after collecting the complete buffer using Behavioral (after Behavioral training concludes) right? Is this also the reason why there is no such line in (b) (because the buffer is in the process of collecting transitions)?
Based on the experiments in Figure 1, can it be understood that
(1) even if offline RL uses a dataset with sufficient coverage, the extrapolation error (caused by the actor in DDPG taking an action out-of-distribution) leads to suboptimal performance?
(2) even if offline RL uses the same buffer as Behavioral, because the transitions in the buffer are still not generated by offline RL itself, there is still a distribution shift issue.
(3) even if offline RL is trained with expert or nearly expert data, without encountering "bad(early-state) data," it may fail to learn which actions should be avoided and cause the performance worse than the final and concurrent buffers.

Buffer size

Why is the default buffer size 10% of the paper's?

Discrete Environment other than Atari

Hi,
I am very interested in your paper, especially for the discrete one. However, I found it hard to train an Atari game on my laptop. Instead of Atari, I want to try some naive states other than pixels. I noticed that you implemented the fully connected layers for non-Atari environments. Could you please give some advice on the environment?

A problem

Traceback (most recent call last):
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\main.py", line 297, in
train_BCQ(env, replay_buffer, is_atari, num_actions, state_dim, device, args, parameters)
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\main.py", line 144, in train_BCQ
replay_buffer.load(f"./buffers/{buffer_name}")
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\utils.py", line 110, in load
reward_buffer = np.load(f"{save_folder}_reward.npy")
File "D:\anconda\envs\pytorch\lib\site-packages\numpy\lib\npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: './buffers/Default_PongNoFrameskip-v0_0_reward.npy'

进程已结束,退出代码1

Did you test your discrete-BCQ code before?

According your setting, I just can't train a behavioral agent for the DQN model don't converge. I want to know whether I make some mistakes or there is something wrong in your code. Can you help me?

Can the original BCQ paper be used with a discrete action space?

Hi,

I've been reading the original and discretised BCQ papers and wanted to ask if the original BCQ algorithm could be applied to an environment with discrete actions. I'm probably misinterpreting sections on both papers, but to my understanding, the original BCQ algorithm could handle both continuous and discrete actions, whereas the discretised BCQ algorithm focuses solely on discrete actions. I am curious about the VAE in the original paper and wonder if it can be used in a discrete setting.

Also, I wanted to ask if the "i1/i2" layers in the discretised BCQ network represent the generative model? I'm having some difficulty relating the "i" layers of the BCQ network to Algorithm 1 in the discrete BCQ paper.

I appreciate any help with this query!

Does discrete BCQ use vae?

is this?
"""
i_loss = F.nll_loss(imt, action.reshape(-1))
"""

thx!

Performance of DDPG and BCQ

I am trying to reproduce the results of continuous environment, but the results are poor. Could you please give more details about the results? For example, what is the result when we run "python main.py --train_behavioral --gaussian_std 0.1"?