hunkim / reinforcementzerotoall Goto Github PK

View Code? Open in Web Editor NEW

249.0 25.0 132.0 111 KB

Python 80.85% Jupyter Notebook 19.15%

reinforcementzerotoall's Introduction

Reinforcement Zero to All

This is work in progress and it may have bugs. However, we call for your comments and pull requests.

We emphasize on the following:

Readiability over anything else
- That's why we choose Python
Pythonic code
- PEP8
- Docstring
Use High Level Tensorflow API
- Cleaner and easier to understand
KISS
- Keep It Simple Stupid

Lecture videos

Youtube

File naming rule

99_9_description.py

First two digits indicates a category of algorithms
- 07: DQN
- 08: Policy Gradient
- 09: Random Search Methods
- 10: Actor Critic
A second digit indicates an id
Description shows what the file is about

How to use uploader

It makes the uploading process a little bit simpler

Go to https://gym.openai.com/
Login with your github account
- https://gym.openai.com/users/YOUR_GITHUB_ACCOUNT
Copy your OpenAI api key from the upper right corner of your profile page
Modify gym.ini
In console

#python gym_uploader.py /path/to/gym_results
python gym_uploader.py gym-results/

Install requirements

pip install -r requirements.txt

Run test and autopep8

TODO: Need to add more test cases

pytest

# pip install autopep8 # if you haven't install
autopep8 . --recursive --in-place --pep8-passes 2000 --verbose --ignore E501

Contributions/Comments

We always welcome your comments and pull requests.

reinforcementzerotoall's People

Contributors

Stargazers

Watchers

Forkers

zeran4 kimsungjin kkweon oppa3109 charlie13 wsjeon imcomking seanlee10 rubythonode jeilove 4skynet gm06041 mulan101 hal2001 danielwshim 404akhan tapattan lkh-1 markjingnb yungbyun niilante labyrins parksunwoo hanssoo shagru jaehyek picopoco qqiang00 yeolpyeong hjhjpark grayhong dhlee421 dkfromsd necronia siscos kyungkoo70 shirishgoyal osirisjs collector-m kgeneral meelement sungreong superpiggy again4you paulpaul91 shinjaehun ririro93 yongyongyooo yms9654 ryfan-rs lab930boss hccho2 dronerl2020 jinyeong resoliwan ehdgks0627 foranything lifegear yangwooseong nhatnguyen12 seedfac juny1905 ethan-cho jerry4897 liviust pb-pravin brunoson batermj bjo9280 moon0823 yuhyeonkim llejo3 milyangparkjaehun yskim525 learnaidrist w0lv3r1nix hishoss substage kevintrannz dsp6414 bilgehannevruz oceavos tungk noname2048 459548764 adldotori kyuhwas 8bitscoding hyungjunelee weehad jukyellow munki-chung harimyi harkyoo srshin jayjun911 ooksang seungminjang meng216073 munjuhyeok

reinforcementzerotoall's Issues

DQN implementations should be updated

Summary

DQN implementations need to be updated/modified
I'm going to leave this issue as a future reference because I am not going to work on it now
Anyone is welcome to contribute

Problem

After the max step is set to 200, the following DQN implementations won't clear the CartPole-v0

07_1_q_net_cartpole.py
07_2_dqn_2013_cartpole.py
07_3_dqn_2015_cartpole.py

[CartPole-v0 clear condition: average reward >= 195 over 100 games]

Possible fix

Currently in the code, it iterates 50 times updates, and it can be a problem because an initial policy is bad, and it's fitting to the bad policy.

if episode % 10 == 1:  # train every 10 
    # Get a random batch of experiences.
    for  _ in range(50):
        minibatch = random.sample(replay_buffer, 10)
        loss, _ = ddqn_replay_train(mainDQN, targetDQN, minibatch)

Note that DQN is actually an online algorithm. In the original paper(Mnih et al., 2013), it iterates once per each step.
Add a target network as it was suggested in the original paper in 2013 though it should work without a target network.

the PowerPoint

Need to Improve Discounted Reward

Issue

I came to notice that the current discount reward function is not summing up the future rewards.
I'm not sure if it's intended, but even if it's intended, the policy gradient will not behave as intended because it will focus on learning the very first action of each episode

Recall that the Discounted Reward Function is

Example

r = [r0, r1, r2] = [1, 1, 1]
gamma = 0.99
expected discount r = [1 + 0.99 + 0.99 ** 2, 1 + 0.99, 1]
current discounted r implementation = [1, 0.99, 0.99**2]

Implementation in this repo

def discount_rewards(r, gamma=0.99):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [1, 0.99, 0.9801] -> [1.22 -0.004 -1.22]
    """
    d_rewards = np.array([val * (gamma ** i) for i, val in enumerate(r)])

    # Normalize/standardize rewards
    d_rewards -= d_rewards.mean()
    d_rewards /= d_rewards.std()
    return d_rewards

Correct Implementation (from Kapathy's code)

def discount_correct_rewards(r, gamma=0.99):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(range(0, r.size)):
    #if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add

  discounted_r -= discounted_r.mean()
  discounted_r /- discounted_r.std()
  return discounted_r

With Latex

Therefore, the above function should change as below

10_1_Actor_Critic.ipynb 에 관한 질문입니다! (Question about 10_1 Actor_Critic)

네트워크 클래스에 create_op()에서
policy_gain, value_loss, entropy 로 loss구성하셨는데.. 어떤 이론이 뒷받침 되는지

혹은 무엇을 reference로 하여 작성하셨는지 여주고 싶습니다!!
(특히 각각이 계산되는 식이 왜 저렇게 되는지 잘 이해가 가지 않습니다...)

I want to ask about s.t. in create_op() function of ActorCriticNetwork class.
There are values (policy_gain, value_loss, entropy) in create_op().
Could you explain how you get to those values or how those values are calculated?
Which reference did you use to get those values?

Thank you.

08_4_softmax_pg_pong.py 에서 image reshape에 관한 질문

08_4_softmax_pg_pong.py 에 다음과 같은 부분이 있습니다.

X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 80, 80, 4])

여기서 placeholder에 넘어오는 array는 (80x80)이미지를 4개 모아 (4,6400)으로 만든 후, 다시
flatten하여 (25600,) 형태로 넘어온 것입니다.
이것을 (80,80,4)로 reshape하면 이미지로서의 특성은 깨져버립니다.

X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.transpose(tf.reshape(X, [-1, 4, 80, 80]),[0,2,3,1])

이렇게 해야되지 않나요?

08_4_softmax_pg_pong_y.py ---> model restore BUG

agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
                      checkpoint_dir="checkpoints")
init = tf.global_variables_initializer()
sess.run(init)

Agent를 선언하면서, model을 restore하고 있습니다. 그런데, 그 아래에서 initialization이 이루어지고 있습니다.
순서가 바뀌어야 합니다.

init = tf.global_variables_initializer()
sess.run(init)
agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
                      checkpoint_dir="checkpoints")

Bug?

10_1_Actor_Critic.ipynb 에서

policy_gain = tf.reduce_sum(policy_gain, name="policy_gain")

이렇게 되어 있는데, 아래와 같이 바뀌어야 될 것 같습니다.

policy_gain = tf.reduce_mean(policy_gain, name="policy_gain")