If I have pre-trained offline models and I have loaded them as policy. Then I want

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Question] How to append online transitions with pre-existing d4rl buffer in finetuning training? about d3rlpy HOT 10 CLOSED

HYDesmondLiu commented on June 10, 2024 1

[Question] How to append online transitions with pre-existing d4rl buffer in finetuning training?

from d3rlpy.

Comments (10)

takuseno commented on June 10, 2024 1

Which dataset are you using? Because let's say the maximum capacity of the replay buffer is 1M and the dataset size is below 1M. In that case, there is still room to store new experiences in the replay buffer. If n_steps of fit_online is very large such as 1M, do you still see the same result?

from d3rlpy.

takuseno commented on June 10, 2024 1

I couldn't actually reproduce your problem on my end. Here is my test code:

import d3rlpy

dqn = d3rlpy.algos.DQN(batch_size=10)
dataset, env = d3rlpy.datasets.get_cartpole()
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100, env=env, episodes=dataset.episodes)
print(buffer.transitions[0].observation)
dqn.fit_online(env, buffer, n_steps=100)
print(buffer.transitions[0].observation)

Note that I'm using gym==0.17.0 for compatibility issues, but this does not affect the behavior of ReplayBuffer. The result is:

[0.37582362 0.5620217  0.00999125 0.24562645]
[-0.00132633 -0.599253   -0.00662717  0.88046914]

Can you share the complete example that I can reproduce your problem?

from d3rlpy.

HYDesmondLiu commented on June 10, 2024 1

Thanks for the instruction and replies.
I have created a snippet as a simplified example. And now it works.
There must be something I have done wrong in previous codes.

import d3rlpy

dataset, env = d3rlpy.datasets.get_dataset("hopper-expert-v2")
policy = d3rlpy.algos.CQL(use_gpu=True)
policy.fit(dataset, n_steps = 100, n_steps_per_epoch=50, show_progress=False)
policy.save_model(f'CQL.pt')

pretrained_model_name = f'CQL.pt'
policy.build_with_env(env)
policy.load_model(pretrained_model_name)


buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=20, env=env, episodes=dataset.episodes)    

for _ in range(2):
    policy.fit_online(env, buffer, n_steps=100, n_steps_per_epoch=50, show_progress=False)
    t0=buffer.transitions[0]
    print(f'first obs.:{t0.observation}')


first obs.:[ 1.2099096  -0.04010323 -0.07108203 -0.003647    0.16283353  0.3001447
 -0.03584122 -0.22062637 -0.29892066 -0.403349   -0.8289119 ]
first obs.:[ 1.2481172  -0.01057371 -0.00299744 -0.01796114 -0.02033593  0.07686285
 -0.2977234   0.9424554   0.8758408   0.48696142 -0.93829495]

from d3rlpy.

takuseno commented on June 10, 2024

@HYDesmondLiu Hi, thanks for the issue. Please check this IQL reproduction script.

d3rlpy/reproductions/finetuning/iql.py

Line 61 in a11b15d

buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=1000000,

from d3rlpy.

HYDesmondLiu commented on June 10, 2024

Thanks for the prompt reply.
OKAY, let me ask you this way. Since the buffer is FIFO, I have tried to fill the buffer with the dataset's transitions.
And with CQL.fit_online, the first transition is not changed after finetuning via interacting with the environment.

My workaround is to enlarge the maxlen of the buffer, I believe there is a bug in the append code block.

from d3rlpy.

takuseno commented on June 10, 2024

Can you make sure if you set episodes=dataset.episodes like the example above?

from d3rlpy.

HYDesmondLiu commented on June 10, 2024

I have set it and it does not work either.
Since it is a queue, the first element should be removed when we try to add a new element to the full buffer.
However, I use buffer.transitions[0] to print out the first observation in the buffer. It is always the same.

Basically, I was doing the training as the following snippet. Please let me know if anything is set incorrectly that
leads to the buffer is not updated after training online.

dataset, env = d3rlpy.datasets.get_dataset(env_name)
policy = d3rlpy.algos.CQL(use_gpu=True)
pretrained_model_name = f'xxx.pt'
policy.build_with_env(env)
policy.load_model(pretrained_model_name)
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=1_000_000, env=env, episodes=dataset.episodes) 
policy.fit_online(env, buffer, n_steps=250_000)
t0=buffer.transitions[0]
tl=buffer.transitions[-1]
print(f'first obs.:{t0.observation}/last obs.:{tl.observation}')

from d3rlpy.

HYDesmondLiu commented on June 10, 2024

I am using MuJoCo, however, I deliberately downsample the dataset on purpose.

Since in real-world offline-to-online finetuning problems there might not be access to while offline dataset.

So actually I downsample the buffer with only 5% of the original amount. So the buffers are filled up really quick.

from d3rlpy.

takuseno commented on June 10, 2024

I see. Then, I believe that the replay buffer still has space to store new experiences. You can test this by decreasing maxlen of the replay buffer to the very small value to see if it works correctly.

from d3rlpy.

HYDesmondLiu commented on June 10, 2024

Thanks for quick response. Actually that was what I did. The code snippet I provided is modified. The maxlen I used was only 5% of the original size of the buffer. And I output the first element of the observations after every fit_online was done. And it was always the same one.

from d3rlpy.

[Question] How to append online transitions with pre-existing d4rl buffer in finetuning training? about d3rlpy HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent