sfujim / bcq Goto Github PK
View Code? Open in Web Editor NEWAuthor's PyTorch implementation of BCQ for continuous and discrete actions
License: MIT License
Author's PyTorch implementation of BCQ for continuous and discrete actions
License: MIT License
According your setting, I just can't train a behavioral agent for the DQN model don't converge. I want to know whether I make some mistakes or there is something wrong in your code. Can you help me?
Hi,
I've been reading the original and discretised BCQ papers and wanted to ask if the original BCQ algorithm could be applied to an environment with discrete actions. I'm probably misinterpreting sections on both papers, but to my understanding, the original BCQ algorithm could handle both continuous and discrete actions, whereas the discretised BCQ algorithm focuses solely on discrete actions. I am curious about the VAE in the original paper and wonder if it can be used in a discrete setting.
Also, I wanted to ask if the "i1/i2" layers in the discretised BCQ network represent the generative model? I'm having some difficulty relating the "i" layers of the BCQ network to Algorithm 1 in the discrete BCQ paper.
I appreciate any help with this query!
Hi, I think BCQ addresses the extrapolation error very well. And I'm curious about how to plot the figure 1.f as the paper show based on the collected offline batch. Great thanks for you reply
First of all, thanks for sharing the source code.
My question is whether the "done" condition was used incorrectly in the discrete action branch?
BCQ/discrete_BCQ/discrete_BCQ.py
Line 137 in 4876f7e
I guess it should be
target_Q = reward + (1-done) * self.discount * q.gather(1, next_action).reshape(-1, 1)
,
when the flag is "done" instead of "not_done".
Looking forward to your reply.
Hello, I am currently studying offline reinforcement learning and came across BCQ. It's a great work worth delving into. However, I have some questions regarding the paper that I'd like to clarify and ensure that I haven't misunderstood. My questions might be numerous, but I genuinely want to understand the experimental details in the paper.
Here are my questions:
In Figure 1, does "Off-policy DDPG" refer to DDPG trained using a fixed dataset without interaction with environment? Additionally, as a benchmark for comparison, does "Behavioral" refer to DDPG trained using the normal training process?
In Figure 1, for the three experiments with different buffers, is "Final" understood as training a Behavioral DDPG, recording transitions during the training process, and then using the final buffer as a dataset for Off-Policy DDPG training (with no new transitions added to the buffer during training)? Can "Concurrent" be simply understood as Off-Policy DDPG gradually using transitions from early-stage to late-stage during the training process, rather than having the chance to sample late-stage transitions right from the beginning?
In Figure 1, the orange horizontal lines in (a)(c) represent calculating the episode average return after collecting the complete buffer using Behavioral (after Behavioral training concludes) right? Is this also the reason why there is no such line in (b) (because the buffer is in the process of collecting transitions)?
Based on the experiments in Figure 1, can it be understood that
(1) even if offline RL uses a dataset with sufficient coverage, the extrapolation error (caused by the actor in DDPG taking an action out-of-distribution) leads to suboptimal performance?
(2) even if offline RL uses the same buffer as Behavioral, because the transitions in the buffer are still not generated by offline RL itself, there is still a distribution shift issue.
(3) even if offline RL is trained with expert or nearly expert data, without encountering "bad(early-state) data," it may fail to learn which actions should be avoided and cause the performance worse than the final and concurrent buffers.
is this?
"""
i_loss = F.nll_loss(imt, action.reshape(-1))
"""
thx!
Why is the default buffer size 10% of the paper's?
Thank you for the clean codebase!
I'm trying to reproduce the results in the paper, but I'm unable to obtain an expert with good performance while running the code as is.
I suspect it's because of software version mismatch. Can you please post the software packages configuration that you used to run the code?
Thank you!
The packages in my conda environment can be found below. With these package versions, I was only able to obtain a policy with episode return approximately 230 on Hopper-v1, which is quite low.
Hello, first of all, thanks for this great work!
I found an issue with this line:
BCQ/discrete_BCQ/discrete_BCQ.py
Line 53 in 9690927
I am trying to reproduce the results of continuous environment, but the results are poor. Could you please give more details about the results? For example, what is the result when we run "python main.py --train_behavioral --gaussian_std 0.1"?
Traceback (most recent call last):
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\main.py", line 297, in
train_BCQ(env, replay_buffer, is_atari, num_actions, state_dim, device, args, parameters)
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\main.py", line 144, in train_BCQ
replay_buffer.load(f"./buffers/{buffer_name}")
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\utils.py", line 110, in load
reward_buffer = np.load(f"{save_folder}_reward.npy")
File "D:\anconda\envs\pytorch\lib\site-packages\numpy\lib\npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: './buffers/Default_PongNoFrameskip-v0_0_reward.npy'
进程已结束,退出代码1
Hi,
Thanks for sharing the source code. I am trying to implement your code on my environment: I have my {action, state, next_state, reward, not_done, ptr} npy files all prepared in the ./buffer folder. Then I start to train my BCQ, however I saw that train_BCQ
in main.py
in the "while training_iters < args.max_timesteps:
loop is calling eval_policy
, I thought batch RL does not interact with environment. If so how do I set the env.step
in my environment?
imt = (imt/imt.max(1, keepdim=True)[0] > self.threshold).float()
should be
imt = (imt/imt.max(1, keepdim=True) > self.threshold).float()
Hi,
I am very interested in your paper, especially for the discrete one. However, I found it hard to train an Atari game on my laptop. Instead of Atari, I want to try some naive states other than pixels. I noticed that you implemented the fully connected layers for non-Atari environments. Could you please give some advice on the environment?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.