This repository is the official implementation of the paper A Supervised Learning Framework for Batch Reinforcement Learning
submitted to NeurIPS 2020.
- Python version: Python 3.6.8 :: Anaconda custom (64-bit)
- numpy == 1.18.1
- pandas == 1.0.3
- sklearn == 0.22.1
- tensorflow == 2.1.0
- pickle
- os
- sys
- random
models
: network structuresagents
: DQN, MultiHeadDQN, QR-DQN agentsreplay_buffers
: basic and priporitized replay buffersalgos
: behavior cloning, density estimator, advantage learner, fitted Q evaluation, etc
python train_qr_dqn_agent.py &
in the lunarlander-v2 folder: online training a QR-DQN agent in the Gym LunarLander-v2 enviroment, this takes nearly three hours without GPU support.- copy the
trajs_qr_dqn.pkl
underonline
folder produced by the first step todqn_2_200/random/
folder, and runpython batch_sale_random_dqn.py &
(around 20 hours without GPU support). This will generate DQN offline training results. Similarly, we can obtain DDQN, QR-DQN results, when we use random or the first 200 trajectories, our results are given inlunarlander-v2/plot_figs
. python plot_ckpts_avg_figs.py &
andpython plot_ckpts_last_figs.py &
to generate figures in our paper.
- run the scripts under realdata after putting
trajs.pkl
of real data in therealdata/data
folder.trajs.pkl
are a list of list of transitions(s,a,r,s',done)
we assume that the forward and backward of network complexity is S
- step 2: training
L
DQN agents, batch sizeB_1
, training stepsI_1
, totalO(L * I_1 * B_1 * S)
- step 3: training
L
density estimators, batch sizeB_2
, training stepsI_2
, totalO(L * I_2 * B_2^4 * S)
- step 4: pseudo Q computations, batch size
B_3
, totalO(B_3 * N * T * A * S)
, whereN
number of trajs,T
average length of trajs,A
number of actions. - step 5: training tau, batch size
B_4
, training stepsI_4
, totalO(I_4 * B_4 * S)