Code Monkey home page Code Monkey logo

bcq's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bcq's Issues

Did you test your discrete-BCQ code before?

According your setting, I just can't train a behavioral agent for the DQN model don't converge. I want to know whether I make some mistakes or there is something wrong in your code. Can you help me?

Can the original BCQ paper be used with a discrete action space?

Hi,

I've been reading the original and discretised BCQ papers and wanted to ask if the original BCQ algorithm could be applied to an environment with discrete actions. I'm probably misinterpreting sections on both papers, but to my understanding, the original BCQ algorithm could handle both continuous and discrete actions, whereas the discretised BCQ algorithm focuses solely on discrete actions. I am curious about the VAE in the original paper and wonder if it can be used in a discrete setting.

Also, I wanted to ask if the "i1/i2" layers in the discretised BCQ network represent the generative model? I'm having some difficulty relating the "i" layers of the BCQ network to Algorithm 1 in the discrete BCQ paper.

I appreciate any help with this query!

Whether the "done" condition was used incorrectly in the discrete action branch?

First of all, thanks for sharing the source code.

My question is whether the "done" condition was used incorrectly in the discrete action branch?

target_Q = reward + done * self.discount * q.gather(1, next_action).reshape(-1, 1)

I guess it should be
target_Q = reward + (1-done) * self.discount * q.gather(1, next_action).reshape(-1, 1),
when the flag is "done" instead of "not_done".

Looking forward to your reply.

Some questions about the experiments for demonstrating extrapolation error.

Hello, I am currently studying offline reinforcement learning and came across BCQ. It's a great work worth delving into. However, I have some questions regarding the paper that I'd like to clarify and ensure that I haven't misunderstood. My questions might be numerous, but I genuinely want to understand the experimental details in the paper.

Here are my questions:

  1. In Figure 1, does "Off-policy DDPG" refer to DDPG trained using a fixed dataset without interaction with environment? Additionally, as a benchmark for comparison, does "Behavioral" refer to DDPG trained using the normal training process?

  2. In Figure 1, for the three experiments with different buffers, is "Final" understood as training a Behavioral DDPG, recording transitions during the training process, and then using the final buffer as a dataset for Off-Policy DDPG training (with no new transitions added to the buffer during training)? Can "Concurrent" be simply understood as Off-Policy DDPG gradually using transitions from early-stage to late-stage during the training process, rather than having the chance to sample late-stage transitions right from the beginning?

  3. In Figure 1, the orange horizontal lines in (a)(c) represent calculating the episode average return after collecting the complete buffer using Behavioral (after Behavioral training concludes) right? Is this also the reason why there is no such line in (b) (because the buffer is in the process of collecting transitions)?

  4. Based on the experiments in Figure 1, can it be understood that
    (1) even if offline RL uses a dataset with sufficient coverage, the extrapolation error (caused by the actor in DDPG taking an action out-of-distribution) leads to suboptimal performance?
    (2) even if offline RL uses the same buffer as Behavioral, because the transitions in the buffer are still not generated by offline RL itself, there is still a distribution shift issue.
    (3) even if offline RL is trained with expert or nearly expert data, without encountering "bad(early-state) data," it may fail to learn which actions should be avoided and cause the performance worse than the final and concurrent buffers.

Buffer size

Why is the default buffer size 10% of the paper's?

Can you please post software packages configuration?

Thank you for the clean codebase!

I'm trying to reproduce the results in the paper, but I'm unable to obtain an expert with good performance while running the code as is.

I suspect it's because of software version mismatch. Can you please post the software packages configuration that you used to run the code?

Thank you!

The packages in my conda environment can be found below. With these package versions, I was only able to obtain a policy with episode return approximately 230 on Hopper-v1, which is quite low.

conda_packages.pdf

A potential bug in discrete BCQ FC network

Hello, first of all, thanks for this great work!
I found an issue with this line:

i = F.relu(self.i3(i))

I don't think the final output of imitation network logits should be passed by a ReLU layer. If I remove this layer, the code can work on toy examples with the discrete setting.
Also, the convolution network doesn't have the ReLU layer in the imitation logits final output. So I think this should be removed. Please correct me if I'm wrong.

Performance of DDPG and BCQ

I am trying to reproduce the results of continuous environment, but the results are poor. Could you please give more details about the results? For example, what is the result when we run "python main.py --train_behavioral --gaussian_std 0.1"?

A problem


Traceback (most recent call last):
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\main.py", line 297, in
train_BCQ(env, replay_buffer, is_atari, num_actions, state_dim, device, args, parameters)
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\main.py", line 144, in train_BCQ
replay_buffer.load(f"./buffers/{buffer_name}")
File "C:\Users\user\Desktop\BCQ-master\BCQ-master\discrete_BCQ\utils.py", line 110, in load
reward_buffer = np.load(f"{save_folder}_reward.npy")
File "D:\anconda\envs\pytorch\lib\site-packages\numpy\lib\npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: './buffers/Default_PongNoFrameskip-v0_0_reward.npy'

进程已结束,退出代码1

Training BCQ while evaluating policy with environment?

Hi,
Thanks for sharing the source code. I am trying to implement your code on my environment: I have my {action, state, next_state, reward, not_done, ptr} npy files all prepared in the ./buffer folder. Then I start to train my BCQ, however I saw that train_BCQ in main.py in the "while training_iters < args.max_timesteps: loop is calling eval_policy, I thought batch RL does not interact with environment. If so how do I set the env.step in my environment?

bug?

imt = (imt/imt.max(1, keepdim=True)[0] > self.threshold).float()

should be

imt = (imt/imt.max(1, keepdim=True) > self.threshold).float()

Discrete Environment other than Atari

Hi,
I am very interested in your paper, especially for the discrete one. However, I found it hard to train an Atari game on my laptop. Instead of Atari, I want to try some naive states other than pixels. I noticed that you implemented the fully connected layers for non-Atari environments. Could you please give some advice on the environment?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.