Code Monkey home page Code Monkey logo

reinforcementzerotoall's Introduction

Reinforcement Zero to All

This is work in progress and it may have bugs. However, we call for your comments and pull requests.

We emphasize on the following:

  • Readiability over anything else
    • That's why we choose Python
  • Pythonic code
    • PEP8
    • Docstring
  • Use High Level Tensorflow API
    • Cleaner and easier to understand
  • KISS

Lecture videos

File naming rule

99_9_description.py
  • First two digits indicates a category of algorithms
    • 07: DQN
    • 08: Policy Gradient
    • 09: Random Search Methods
    • 10: Actor Critic
  • A second digit indicates an id
  • Description shows what the file is about

How to use uploader

It makes the uploading process a little bit simpler

  1. Go to https://gym.openai.com/
  2. Login with your github account
  3. Copy your OpenAI api key from the upper right corner of your profile page
    user
  4. Modify gym.ini
  5. In console
#python gym_uploader.py /path/to/gym_results
python gym_uploader.py gym-results/

Install requirements

pip install -r requirements.txt

Run test and autopep8

TODO: Need to add more test cases

pytest
# pip install autopep8 # if you haven't install
autopep8 . --recursive --in-place --pep8-passes 2000 --verbose --ignore E501

Contributions/Comments

We always welcome your comments and pull requests.

reinforcementzerotoall's People

Contributors

hunkim avatar imcomking avatar jerry4897 avatar jinyul80 avatar kkweon avatar zeran4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reinforcementzerotoall's Issues

DQN implementations should be updated

Summary

  • DQN implementations need to be updated/modified
  • I'm going to leave this issue as a future reference because I am not going to work on it now
  • Anyone is welcome to contribute

Problem

After the max step is set to 200, the following DQN implementations won't clear the CartPole-v0

  • 07_1_q_net_cartpole.py
  • 07_2_dqn_2013_cartpole.py
  • 07_3_dqn_2015_cartpole.py

[CartPole-v0 clear condition: average reward >= 195 over 100 games]

Possible fix

  • Currently in the code, it iterates 50 times updates, and it can be a problem because an initial policy is bad, and it's fitting to the bad policy.
if episode % 10 == 1:  # train every 10 
    # Get a random batch of experiences.
    for  _ in range(50):
        minibatch = random.sample(replay_buffer, 10)
        loss, _ = ddqn_replay_train(mainDQN, targetDQN, minibatch)
  • Note that DQN is actually an online algorithm. In the original paper(Mnih et al., 2013), it iterates once per each step.
    algorithm
  • Add a target network as it was suggested in the original paper in 2013 though it should work without a target network.

Need to Improve Discounted Reward

Issue

I came to notice that the current discount reward function is not summing up the future rewards.
I'm not sure if it's intended, but even if it's intended, the policy gradient will not behave as intended because it will focus on learning the very first action of each episode

Recall that the Discounted Reward Function is
discount r

Example

  • r = [r0, r1, r2] = [1, 1, 1]
  • gamma = 0.99
  • expected discount r = [1 + 0.99 + 0.99 ** 2, 1 + 0.99, 1]
  • current discounted r implementation = [1, 0.99, 0.99**2]

Implementation in this repo

def discount_rewards(r, gamma=0.99):
    """Takes 1d float array of rewards and computes discounted reward
    e.g. f([1, 1, 1], 0.99) -> [1, 0.99, 0.9801] -> [1.22 -0.004 -1.22]
    """
    d_rewards = np.array([val * (gamma ** i) for i, val in enumerate(r)])

    # Normalize/standardize rewards
    d_rewards -= d_rewards.mean()
    d_rewards /= d_rewards.std()
    return d_rewards

Correct Implementation (from Kapathy's code)

def discount_correct_rewards(r, gamma=0.99):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(range(0, r.size)):
    #if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add

  discounted_r -= discounted_r.mean()
  discounted_r /- discounted_r.std()
  return discounted_r

With Latex

Therefore, the above function should change as below

discount_reward_2

10_1_Actor_Critic.ipynb 에 관한 질문입니다! (Question about 10_1 Actor_Critic)

네트워크 클래스에 create_op()에서
policy_gain, value_loss, entropy 로 loss구성하셨는데.. 어떤 이론이 뒷받침 되는지

혹은 무엇을 reference로 하여 작성하셨는지 여주고 싶습니다!!
(특히 각각이 계산되는 식이 왜 저렇게 되는지 잘 이해가 가지 않습니다...)

I want to ask about s.t. in create_op() function of ActorCriticNetwork class.
There are values (policy_gain, value_loss, entropy) in create_op().
Could you explain how you get to those values or how those values are calculated?
Which reference did you use to get those values?

Thank you.

08_4_softmax_pg_pong.py 에서 image reshape에 관한 질문

08_4_softmax_pg_pong.py 에 다음과 같은 부분이 있습니다.

X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 80, 80, 4])

여기서 placeholder에 넘어오는 array는 (80x80)이미지를 4개 모아 (4,6400)으로 만든 후, 다시
flatten하여 (25600,) 형태로 넘어온 것입니다.
이것을 (80,80,4)로 reshape하면 이미지로서의 특성은 깨져버립니다.

X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.transpose(tf.reshape(X, [-1, 4, 80, 80]),[0,2,3,1])

이렇게 해야되지 않나요?

08_4_softmax_pg_pong_y.py ---> model restore BUG

agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
                      checkpoint_dir="checkpoints")
init = tf.global_variables_initializer()
sess.run(init)

Agent를 선언하면서, model을 restore하고 있습니다. 그런데, 그 아래에서 initialization이 이루어지고 있습니다.
순서가 바뀌어야 합니다.

init = tf.global_variables_initializer()
sess.run(init)
agent = Agent(new_HW + [repeat], output_dim=action_dim, logdir='logdir/train',
                      checkpoint_dir="checkpoints")

Bug?

10_1_Actor_Critic.ipynb 에서

policy_gain = tf.reduce_sum(policy_gain, name="policy_gain")

이렇게 되어 있는데, 아래와 같이 바뀌어야 될 것 같습니다.

policy_gain = tf.reduce_mean(policy_gain, name="policy_gain")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.