seungeunrho / minimalrl Goto Github PK

View Code? Open in Web Editor NEW

2.7K 49.0 450.0 62 KB

Implementations of basic RL algorithms with minimal lines of codes! (pytorch based)

License: MIT License

Python 100.00%

deep-reinforcement-learning pytorch simple deep-learning a3c ppo a2c reinforce acer dqn

minimalrl's Introduction

minimalRL-pytorch

Implementations of basic RL algorithms with minimal lines of codes! (PyTorch based)

Each algorithm is complete within a single file.
Length of each file is up to 100~150 lines of codes.
Every algorithm can be trained within 30 seconds, even without GPU.
Envs are fixed to "CartPole-v1". You can just focus on the implementations.

Algorithms

REINFORCE (67 lines)
Vanilla Actor-Critic (98 lines)
DQN (112 lines, including replay memory and target network)
PPO (119 lines, including GAE)
DDPG (145 lines, including OU noise and soft target update)
A3C (129 lines)
ACER (149 lines)
A2C (188 lines)
SAC (171 lines) added!!
PPO-Continuous (161 lines) added!!
Vtrace (137 lines) added!!
Any suggestion ...?

Dependencies

PyTorch
OpenAI GYM ( > 0.26.2 IMPORTANT!! No longer support for the previous versions)

Usage

# Works only with Python 3.
# e.g.
python3 REINFORCE.py
python3 actor_critic.py
python3 dqn.py
python3 ppo.py
python3 ddpg.py
python3 a3c.py
python3 a2c.py
python3 acer.py
python3 sac.py

minimalrl's People

Contributors

Stargazers

Watchers

Forkers

harveyphm ssghost wrongwhp jskdr abdelpakey sotte taewook-ko iruluttaluhar h-jia kastnerkyle wook133 shaunstanislauslau kazutoshishinoda kazimbalti shinroo chenyaqiuqiu jiaodaxiaozi zhao9797 0shiva higherhu rmanluo dragen1860 mingmingyang hhy5277 hardikmeisheri zhihaolzh yueyedeai pwaila crystal-tensor clcarwin wangkanger luyifanlu stjordanis ghosthamlet liuyusg damienallonsius scouly gzoumpourlis chengzhicu vskynet cezary-biernacki lity3lenovo yyincc ramonpereira jjwangnlp likeucode hyekang chickenlegai darrenruan nawar29 katjawittfoth ieiuniumlux youtang1993 huangfugui00 ryuncha zhichenml salmcdonagh nsa31 starkhuu faddi hunterhawk seongl leiloong macwiatrak siviltaram cfinlay collector-m pistony donproc yarenty jsrimr jiseonghan wzwtime qing0991 huaizhengzhang roszcz wwxfromtju hdda1lab andrewwang996 substage tobyge seonchoe rahulptel niuduoduo2019 jjangga0214 lovesophia gaoxiaos nickcastro wwiiiii lurenyi233 liangtianxin wldyddl5510 githubwujinming hustacds yuchao-dong master-0f-none liuqiangopenmind matthewsparr dldaisy huisiqi

minimalrl's Issues

Reinforce implementation looks to use old data without importance sampling

The traditional implementation of REINFORCE, without importance sampling should only use data collected by the current policy to update the parameters. However, in reinforce.py, the data buffer doesn't seem to reset after every policy update. Thoughts?

DQN why train iterate for 10 times

https://github.com/seungeunrho/minimalRL/blob/master/dqn.py

minimalRL/dqn.py

Line 63 in 7597b9a

def train(q, q_target, memory, optimizer):

I am wondering why the train method is internally looping 10 times? Shouldn't the policy network train per action?

Add meta RL algorithms?

Hello,

I have enjoyed reading your good examples! Is it possible for you to add a few meta RL algorithms? Thanks!

A naive question about updating parameters in DDPG.

Hi, first of all, thanks for your awesome codes. This is not about any technical issue, but about the algorithm of the DDPG code.

As far as I know, the DDPG method can exploit online parameter update due to the TD learning. But, in your code, the parameters are updated after an episode is over.

I would like to ask you if there are some theoretical background behind this parameter update interval?

Thank you in advance.

Cartpole environment with Multidiscrete action space

Hi, I am trying to create an environment that is a variation of Cartpole.
From the Cartpole definiton:

The studied system is a cart of which a rigid pole is hinged (see figure). The cart is free to move within the bounds of a one-dimensional track. The pole can move inthe vertical plane parallel to the track. The controller can apply a force F to the cart, parallel to the track.

Suppose you can apply a force F but also a multiplier of this force M, so the total force applied is F * M.

The following is the code:

#PPO-LSTM
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import time
import math
import numpy as np
import gym.envs.classic_control

#Hyperparameters
learning_rate = 0.0005
gamma         = 0.98
lmbda         = 0.95
eps_clip      = 0.1
K_epoch       = 2
T_horizon     = 20

class CustomCartpole(gym.envs.classic_control.CartPoleEnv):
    """Add a dimension to the cartpole action space that is used as 'speed' button."""

    def __init__(self, env_config):
        super().__init__()
        self.force_mag = 5.0
        self.action_space = gym.spaces.MultiDiscrete([2, 4])

    def step(self, action):
        err_msg = "%r (%s) invalid" % (action, type(action))
        assert self.action_space.contains(action), err_msg

        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action[0] == 1 else -self.force_mag
        force *= (action[1] + 1)
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass))
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

        if self.kinematics_integrator == 'euler':
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        self.state = (x, x_dot, theta, theta_dot)

        done = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        else:
            if self.steps_beyond_done == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned done = True. You "
                    "should always call 'reset()' once you receive 'done = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_done += 1
            reward = 0.0

        return np.array(self.state), reward, done, {}


class PPO(nn.Module):
    def __init__(self):
        super(PPO, self).__init__()
        self.data = []

        self.fc1   = nn.Linear(4,64)
        self.lstm  = nn.LSTM(64,32)
        self.fc_pi = nn.Linear(32,2)
        self.fc_v  = nn.Linear(32,2)
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def pi(self, x, hidden):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 1, 64)
        x, lstm_hidden = self.lstm(x, hidden)
        x = self.fc_pi(x)
        prob = F.softmax(x, dim=2)
        return prob, lstm_hidden

    def v(self, x, hidden):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 1, 64)
        x, lstm_hidden = self.lstm(x, hidden)
        v = self.fc_v(x)
        return v

    def put_data(self, transition):
        self.data.append(transition)

    def make_batch(self):
        s_lst, a_lst, r_lst, s_prime_lst, prob_a_lst, h_in_lst, h_out_lst, done_lst = [], [], [], [], [], [], [], []
        for transition in self.data:
            s, a, r, s_prime, prob_a, h_in, h_out, done = transition

            s_lst.append(s)
            a_lst.append([a])
            r_lst.append([r])
            s_prime_lst.append(s_prime)
            prob_a_lst.append([prob_a])
            h_in_lst.append(h_in)
            h_out_lst.append(h_out)
            done_mask = 0 if done else 1
            done_lst.append([done_mask])

        s,a,r,s_prime,done_mask,prob_a = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                         torch.tensor(r_lst), torch.tensor(s_prime_lst, dtype=torch.float), \
                                         torch.tensor(done_lst, dtype=torch.float), torch.tensor(prob_a_lst)
        self.data = []
        return s,a,r,s_prime, done_mask, prob_a, h_in_lst[0], h_out_lst[0]

    def train_net(self):
        s,a,r,s_prime,done_mask, prob_a, (h1_in, h2_in), (h1_out, h2_out) = self.make_batch()
        first_hidden  = (h1_in.detach(), h2_in.detach())
        second_hidden = (h1_out.detach(), h2_out.detach())

        for i in range(K_epoch):
            v_prime = self.v(s_prime, second_hidden).squeeze(1)
            td_target = r + gamma * v_prime * done_mask
            v_s = self.v(s, first_hidden).squeeze(1)
            delta = td_target - v_s
            delta = delta.detach().numpy()

            advantage_lst = []
            advantage = 0.0
            for item in delta[::-1]:
                advantage = gamma * lmbda * advantage + item[0]
                advantage_lst.append([advantage])
            advantage_lst.reverse()
            advantage = torch.tensor(advantage_lst, dtype=torch.float)

            pi, _ = self.pi(s, first_hidden)
            pi_a = pi.squeeze(1).gather(1,a)
            ratio = torch.exp(torch.log(pi_a) - torch.log(prob_a))  # a/b == log(exp(a)-exp(b))

            surr1 = ratio * advantage
            surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
            loss = -torch.min(surr1, surr2) + F.smooth_l1_loss(v_s, td_target.detach())

            self.optimizer.zero_grad()
            loss.mean().backward(retain_graph=True)
            self.optimizer.step()

def main():
    #env = gym.make('CartPole-v1')
    env = CustomCartpole({'override_actions': False})
    model = PPO()
    score = 0.0
    print_interval = 20

    for n_epi in range(10000):
        h_out = (torch.zeros([1, 1, 32], dtype=torch.float), torch.zeros([1, 1, 32], dtype=torch.float))
        s = env.reset()
        done = False

        while not done:
            for t in range(T_horizon):
                h_in = h_out
                prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
                prob = prob.view(-1)
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
                s = s_prime

                score += r
                if done:
                    break

            model.train_net()

        if n_epi%print_interval==0 and n_epi!=0:
            print("# of episode :{}, avg score : {:.1f}".format(n_epi, score/print_interval))
            score = 0.0

    env.close()

if __name__ == '__main__':
    main()

This code fails with the following error:

Could you please tell me how to adjust the main loop:

while not done:
    for t in range(T_horizon):
        h_in = h_out
        prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
        prob = prob.view(-1)
        m = Categorical(prob)
        a = m.sample().item()
        s_prime, r, done, info = env.step(a)

        model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
        s = s_prime

        score += r
        if done:
            break

    model.train_net()

to this environment?

TD3: Twin Delayed DDPG

I may suggest you to implement TD3 for your algorithm 9.

TD3

Great job so far :)

Improper asynchronous update in a3c

I doubt whether the asynchronous update made by the current a3c is adhering to what is suggested in the paper. Suppose the workers share the shared_model. Then each worker should:

Copy the weight of the shared network into its local_model
Runs for n steps or end of episode
Calculate gradient
Pass the gradient of the local_model to shared_model
Update the shared_model and go to step 1

Thus, when the local_model is taking the n steps, it's weights do not change.

However, in the current implementation, we directly use the shared_model for taking those n steps. Hence, it may happen that some process P1 updates the weights of shared_model, which might affect the process P2. P2 might have started with some weight configuration of shared_model, which are now modified before those n steps are completed.

I think we can make the following change to the train method to avoid the above phenomenon:

def train(model):
    local_model = ActorCritic()
    local_model.load_state_dict(model.state_dict())
    # Create optimizer for the shared model
    # Create environment
    
    # Take n steps using local_model

    optimizer.zero_grad()
    # Calculate loss and get the gradients     
    loss_fn(local_model(data), labels).backward()

    for param, shared_param in zip(local_model.parameters(), model.parameters()):
        if shared_param.grad is not None:
            shared_param._grad = param.grad

        optimizer.step()

I am not very much familiar with the asynchronous model update in Pytorch but looking at the docs at https://pytorch.org/docs/stable/notes/multiprocessing.html#asynchronous-multiprocess-training-e-g-hogwild, I think we are using the shared_model all the time.

If you think what I say is correct, I can make a PR.

RuntimeError while running DDPG.py

Hi, I got RuntimeError when I executed DDPG.py.
It seems to be occurred during training process of QNet.

RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

I'm using torch 1.5.0 and Python 3.7.3
Is there anyone with the same problem as me?

MuZero minimal implementation

Hi,

First congratulations by this project.

Would be great a minimal implementation of MuZero algorithm.

The paper is here: https://arxiv.org/pdf/1911.08265
The pseudocode is: https://arxiv.org/src/1911.08265v2/anc/pseudocode.py

Thanks.

TF2 implementation for Policy Gradient Reinforce

Soft Actor Critic?

All implementations of SAC I saw use a paid physics sim, any plans to implement it here?

Cartpole environment with Multidiscrete action space

Hi, I am trying to create an environment that is a variation of Cartpole.
From the Cartpole definiton:

The studied system is a cart of which a rigid pole is hinged (see figure). The cart is free to move within the bounds of a one-dimensional track. The pole can move inthe vertical plane parallel to the track. The controller can apply a force F to the cart, parallel to the track.

Suppose you can apply a force F but also a multiplier of this force M, so the total force applied is F * M.

The following is the code:

#PPO-LSTM
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import time
import math
import numpy as np
import gym.envs.classic_control

#Hyperparameters
learning_rate = 0.0005
gamma         = 0.98
lmbda         = 0.95
eps_clip      = 0.1
K_epoch       = 2
T_horizon     = 20

class CustomCartpole(gym.envs.classic_control.CartPoleEnv):
    """Add a dimension to the cartpole action space that is used as 'speed' button."""

    def __init__(self, env_config):
        super().__init__()
        self.force_mag = 5.0
        self.action_space = gym.spaces.MultiDiscrete([2, 4])

    def step(self, action):
        err_msg = "%r (%s) invalid" % (action, type(action))
        assert self.action_space.contains(action), err_msg

        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action[0] == 1 else -self.force_mag
        force *= (action[1] + 1)
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        temp = (force + self.polemass_length * theta_dot ** 2 * sintheta) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (self.length * (4.0 / 3.0 - self.masspole * costheta ** 2 / self.total_mass))
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

        if self.kinematics_integrator == 'euler':
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        self.state = (x, x_dot, theta, theta_dot)

        done = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        else:
            if self.steps_beyond_done == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned done = True. You "
                    "should always call 'reset()' once you receive 'done = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_done += 1
            reward = 0.0

        return np.array(self.state), reward, done, {}


class PPO(nn.Module):
    def __init__(self):
        super(PPO, self).__init__()
        self.data = []

        self.fc1   = nn.Linear(4,64)
        self.lstm  = nn.LSTM(64,32)
        self.fc_pi = nn.Linear(32,2)
        self.fc_v  = nn.Linear(32,2)
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def pi(self, x, hidden):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 1, 64)
        x, lstm_hidden = self.lstm(x, hidden)
        x = self.fc_pi(x)
        prob = F.softmax(x, dim=2)
        return prob, lstm_hidden

    def v(self, x, hidden):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 1, 64)
        x, lstm_hidden = self.lstm(x, hidden)
        v = self.fc_v(x)
        return v

    def put_data(self, transition):
        self.data.append(transition)

    def make_batch(self):
        s_lst, a_lst, r_lst, s_prime_lst, prob_a_lst, h_in_lst, h_out_lst, done_lst = [], [], [], [], [], [], [], []
        for transition in self.data:
            s, a, r, s_prime, prob_a, h_in, h_out, done = transition

            s_lst.append(s)
            a_lst.append([a])
            r_lst.append([r])
            s_prime_lst.append(s_prime)
            prob_a_lst.append([prob_a])
            h_in_lst.append(h_in)
            h_out_lst.append(h_out)
            done_mask = 0 if done else 1
            done_lst.append([done_mask])

        s,a,r,s_prime,done_mask,prob_a = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                         torch.tensor(r_lst), torch.tensor(s_prime_lst, dtype=torch.float), \
                                         torch.tensor(done_lst, dtype=torch.float), torch.tensor(prob_a_lst)
        self.data = []
        return s,a,r,s_prime, done_mask, prob_a, h_in_lst[0], h_out_lst[0]

    def train_net(self):
        s,a,r,s_prime,done_mask, prob_a, (h1_in, h2_in), (h1_out, h2_out) = self.make_batch()
        first_hidden  = (h1_in.detach(), h2_in.detach())
        second_hidden = (h1_out.detach(), h2_out.detach())

        for i in range(K_epoch):
            v_prime = self.v(s_prime, second_hidden).squeeze(1)
            td_target = r + gamma * v_prime * done_mask
            v_s = self.v(s, first_hidden).squeeze(1)
            delta = td_target - v_s
            delta = delta.detach().numpy()

            advantage_lst = []
            advantage = 0.0
            for item in delta[::-1]:
                advantage = gamma * lmbda * advantage + item[0]
                advantage_lst.append([advantage])
            advantage_lst.reverse()
            advantage = torch.tensor(advantage_lst, dtype=torch.float)

            pi, _ = self.pi(s, first_hidden)
            pi_a = pi.squeeze(1).gather(1,a)
            ratio = torch.exp(torch.log(pi_a) - torch.log(prob_a))  # a/b == log(exp(a)-exp(b))

            surr1 = ratio * advantage
            surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
            loss = -torch.min(surr1, surr2) + F.smooth_l1_loss(v_s, td_target.detach())

            self.optimizer.zero_grad()
            loss.mean().backward(retain_graph=True)
            self.optimizer.step()

def main():
    #env = gym.make('CartPole-v1')
    env = CustomCartpole({'override_actions': False})
    model = PPO()
    score = 0.0
    print_interval = 20

    for n_epi in range(10000):
        h_out = (torch.zeros([1, 1, 32], dtype=torch.float), torch.zeros([1, 1, 32], dtype=torch.float))
        s = env.reset()
        done = False

        while not done:
            for t in range(T_horizon):
                h_in = h_out
                prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
                prob = prob.view(-1)
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
                s = s_prime

                score += r
                if done:
                    break

            model.train_net()

        if n_epi%print_interval==0 and n_epi!=0:
            print("# of episode :{}, avg score : {:.1f}".format(n_epi, score/print_interval))
            score = 0.0

    env.close()

if __name__ == '__main__':
    main()

This code fails with the following error:

Could you please tell me how to adjust the main loop:

while not done:
    for t in range(T_horizon):
        h_in = h_out
        prob, h_out = model.pi(torch.from_numpy(s).float(), h_in)
        prob = prob.view(-1)
        m = Categorical(prob)
        a = m.sample().item()
        s_prime, r, done, info = env.step(a)

        model.put_data((s, a, r/100.0, s_prime, prob[a].item(), h_in, h_out, done))
        s = s_prime

        score += r
        if done:
            break

    model.train_net()

to this environment?

Add new algorithms

It would be nice to add the following algorithms:

RAINBOW
A2C (multiprocessing)

I will submit a PR if I finish any of them.

Query about LSTM

Hello, nice and clear implementation! I want to ask something about the LSTM usage. While gatthering experience the input to the LSTM is of dimension [1, 1, 64] which represents 1 timestep of 1 episode along with the 64 FC features?

Also when training on a batch you sample this size eg [20, 1, 64] which corresponds to 20 timesteps?

Finally, shouldn't the hidden state be of the same dimensions except the last? Correspond to the timestep dimension for example? What is the best way to handle using an LSTM is it just an implementation choice?

Maybe a bug in SAC Implementation?

minimalRL/sac.py

Line 73 in a88447c

a_prime, log_prob = self.forward(s_prime)

The actions are w.r.t. s_prime, however, the Q-values are evaluated for s.

minimalRL/sac.py

Line 76 in a88447c

q1_val, q2_val = q1(s,a_prime), q2(s,a_prime)

This doesn't match. Is this a bug?

cartpole ppo train , reward drop

if you train ppo far enough likes 3000 episodes or more, rewards got dropped. (like 500 to 30)

PPO Continuous Action Space

What changes would be required to employ your ppo algorithm in a continuous action space like Pendulum-v0?

Wrong td_target and test() call in a3c implementation

First of all, this is a really nice repo - simple and clean.

I have two issues with the a3c implementation

The td_target calculated in https://github.com/seungeunrho/minimalRL/blob/master/a3c.py#L70 gives the same weight of gamma to calculate the value of the s_prime (the last state visited). Let's say s is your starting state and you are doing n step return, then, the target will be \sum_{i=0}^{n-1} gamma^{i} r_i + gamma^{n} v(s_prime)

You can make the following change https://github.com/seungeunrho/minimalRL/blob/master/a3c.py#L59 to the following,
R = 0.0 if done else model.v(s_prime).detach().item()
then, td_target = R_batch

test might be executed before the training is complete. If you plan to probe how good the model is during training than this is alright. But if you wish to see the model performance once the training is complete then you should fire the test after .join() on train processes.

TypeError: expected np.ndarray (got tuple)

My system environment is below

virtual machine ubuntu 18.04 on windows
miniconda
python 3.9 version

I just copy and paste this minimalRL code in my workspace...
I can not execute the example code.

Thank you

Minimal way to save / replay trained model?

I'm somewhat new to the field of reinforcement learning, and I find these simplistic examples to be extremely helpful -- thank you!

Would you be able to help me with understanding a minimal way to save / replace these trained models?

Please add 1 continuous env

Would it be possible to implement these algorithms in a continuous env like bipedalwalker?

also, SAC is a cool algorithm.

finally it would be wonderful if you posted scores for each algorithm in the readme (so we can compare performance without having to clone and run everything)

my only negative feedback would be, in some places, you use 1 letter only to describe something, when a word would be more clear, and would not add lines. If you want this to be the most clear/simple RL repo, it would be good if readers can more easily understand the algorithm without having to guess "what does this letter mean?"

LSTM + PPO value fitting

Hello, Thanks for your great work!
I have one dumb question.
in LSTM PPO realization, I noticed that when calculating v_prime and v_s, the same first_hidden value is used, my question is: should v_prime use a different first hidden value? or just a approximation.
Thank You!

v_prime = self.v(s_prime, first_hidden).squeeze(1)
td_target = r + gamma * v_prime * done_mask
v_s = self.v(s, first_hidden).squeeze(1)

Problem of `train_net()` in REINFORCE algorithm.

Thanks for the high quality implementations. But I have a question about train_net() in REINFORCE algorithm:

    def train_net(self):
        R = 0
        for r, prob in self.data[::-1]:
            R = r + gamma * R
            loss = -torch.log(prob) * R
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        self.data = []

The policy is updated for each step of an episode. But the policy is supposed to be updated after a complete episode (or batch of episodes).

Since after we do an update, the policy is changed, and the data collected is no longer useful according to the theory.

Use maxlen in deque initializer

In the ReplayBuffer implementation, you can use

self.buffer = collections.deque(maxlen=buffer_limit)

to simplify the put() method -- deque will automatically drop the oldest elements.
Keep up the amazing work!

Missing done mask?

혹시 critic target에서 done mask를 곱해주는것을 빼먹으신거 아닌가요?

minimalRL/ddpg.py

Line 89 in 7095e0f

target = r + gamma * q_target(s_prime, mu_target(s_prime))

Wrong gradient flow in bias correction term of ACER?

minimalRL/acer.py

Line 104 in 46f9b32

    
           loss2 = -correction_coeff * pi * torch.log(pi) * (q.detach()-v) # bias correction term

According to original paper, gradient for bias correction term is define as below,

and as pi serves as the probability for expectation calculation, it seems it's not the target of optimization.

Shouldn't we detach the pi from computational graph at above line?

Training speed is very slow！！！

Readme：Every algorithm can be trained within 30 seconds, even without GPU？it's False

The two places marked in the picture stopped for a long time, and dqn training did not end for more than an hour.

Typo of actor_critic.py?

Hi, @seungeunrho
Thanks for creating wonderful repo like this.

I think I've spotted a typo on actor_critic.py, but since I'm new to RL I'm not sure this is a typo or not.

Isn't https://github.com/seungeunrho/minimalRL/blob/master/actor_critic.py#L48 should be using s_prime_lst instead of s_prime? In other words,

s_batch, a_batch, r_batch, s_prime_batch, done_batch = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                                       torch.tensor(r_lst, dtype=torch.float), torch.tensor(s_prime, dtype=torch.float), \
                                                       torch.tensor(done_lst, dtype=torch.float)

should be

s_batch, a_batch, r_batch, s_prime_batch, done_batch = torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
                                                       torch.tensor(r_lst, dtype=torch.float), torch.tensor(s_prime_lst, dtype=torch.float), \
                                                       torch.tensor(done_lst, dtype=torch.float)

Both of them works, interestingly.

Questions about A3C

Hi,
Thanks for your simple and awesome A3C code.

I got some of the questions to ask.

Q. Why did you put the 'test' process into multi-process with other actor-learner thread?
I think that the 'test' process should be called after all of the other actor-learner threads are done.
OR Is it just to check a train performance of global_model (=test)?

Thank you.

PPO update mistake?

In line 110:

for n_epi in range(10000):
        s = env.reset()
        done = False
        while not done:
            for t in range(T_horizon):
                prob = model.pi(torch.from_numpy(s).float())
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
                s = s_prime

                score += r
                if done:
                    break

            model.train_net() # <------- HERE

I think it should be left shifted to align with while not done, i.e. after collecting data of one episode, we update the networks' parameters. I have tested and this gives stable performance.

for n_epi in range(10000):
        s = env.reset()
        done = False
        while not done:
            for t in range(T_horizon):
                prob = model.pi(torch.from_numpy(s).float())
                m = Categorical(prob)
                a = m.sample().item()
                s_prime, r, done, info = env.step(a)

                model.put_data((s, a, r/100.0, s_prime, prob[a].item(), done))
                s = s_prime

                score += r
                if done:
                    break

        model.train_net() # <------- UPDATED

train() overwrites the base method of nn.Module

You may want to rename it, since .train() and .eval() are important methods of the base class, and new people might be picking up a wrong habit here. Just a thought.

Remove unused import

There are some unused imports in the code, such as

minimalRL/a2c.py

Line 8 in 8c364c3

import time

I believe it would be better if we remove those unused imports. The code will become shorter and cleaner.

Termination of a CartPole episode in REINFORCE.py

Hi.
I think REINFORCE.py:44 should be placed after REINFORCE.py:46. Because once a single episode is terminated, the value of "done" will be False (and won't be reset), causing the main function to skip the while loop in the subsequent episodes.
After all, I'm not sure about this issue. I'm a total newbie.
BTW, all of the implementations are highly efficient, easy-to-customize, easy-to-understand, and very helpful. Thank you for sharing.

Add SAC?

Thank you for this wonderful repo. If you also implement SAC, that could be better.

PPO has no entropy factor

Hey there,

Would it be wise to include entropy factor in your ppo implementation?

How to do that?

Second question I have is why do you not use 0.5*MSE Loss instead of F.smooth_l1_loss.

Here are some snippets as suggestion - but I am not absolutely sure

            surr1 = ratio * advantage
            surr2 = torch.clamp(ratio, 1-eps_clip, 1+eps_clip) * advantage
            actor_loss  = -torch.min(surr1, surr2)
            critic_loss = F.smooth_l1_loss(self.v(s) , td_target.detach())# alternative: 0.5*self.MseLoss(state_values, torch.tensor(rewards))
            #beta       = 0.01 # encourage to explore different policies let at 0.01
            total_loss = critic_loss+actor_loss#- beta*dist_entropy

Including entropy we need a function like this:

    def evaluate(self, state, action):
        #what values are returned here?
        action_probs = self.action_layer(state)
        dist = Categorical(action_probs)

        action_logprobs = dist.log_prob(action)
        dist_entropy = dist.entropy()

        state_value = self.value_layer(state)

        return action_logprobs, torch.squeeze(state_value), dist_entropy

However I am not sure about the best way to include entropy in your implementation.

Glad for some help