mpatacchiola / dissecting-reinforcement-learning Goto Github PK

Python code, PDFs and resources for the series of posts on Reinforcement Learning which I published on my personal blog

Home Page: https://mpatacchiola.github.io/blog/

License: MIT License

Python 100.00%

reinforcement-learning deep-reinforcement-learning markov-chain temporal-differencing-learning sarsa q-learning actor-critic multi-armed-bandit inverted-pendulum mountain-car

dissecting-reinforcement-learning's People

Contributors

Stargazers

Watchers

Forkers

vyraun leoleishi guillermogsjc nagyistge sriharsha0806 haroldss benjamesbabala allensmile sigmaquan zencoding milestonesvn d4le kaiwind88 gongqingyi-github hyo009 yydxlv xqmoo8 ozzcet amoliu sxdkxgwan nkjiangsy tuan-nng little1tow wanjinchang jdc08161063 tonydeep zhanglinalove lic91 frankatmech chpmoreno jan-zeiseweis mehrdad-shokri dinghe shimmeringvoid rjain-veritone cylof22 zwjnju xinyuyun aneesht90 smrjans panda4us morindaz fightwater namlehai roncom akemisetti alfords miketam1021 jenny-nlc ratidevidze ahmadrajab mkjsanghvi piushvaish kreattang jianbotang kuonanhong lp48280421 sina33 tjzjp cthorey hbcbh1999 jontromanab shubhampachori12110095 zzzonkeed raulvigo jbdatascience highbuyer joomladigger sanjitjain2 3774257 alibaheri robot-han dangxuanhong nbarbosa-git veronicachelu rahasayantan kungchinese jamesliao2016 daominglyu vamshivjkrsna himelys tompxu shuxjweb sunyong2016 foxtrotmike comet1537 doandongnguyen jerrylu5683 2bon ajayarunachalam tjtanaa tresink gosuchoi vinaykus lchmo444 hebbianloop decastro-alex jijunwei qihongl hosungjung01

dissecting-reinforcement-learning's Issues

Part.1 Modified Policy Iteration with Simplified Bellman Equation and Linear Algebra Policy Evaluation Infinite Loop

Hello,

I am attempting to run the function "main_linalg()" in policy_iteration.py but the program fails to terminate.

The iterative policy evaluation with the standard policy iteration program returns the correct policy/

After some investigation, I found that if you replace

u = return_policy_evaluation_linalg(p, r, T, gamma)

with

u = return_policy_evaluation(p, u, r, T, gamma)

in the function called main_linalg

What this does is that it changes the implementation to a modified policy iteration algorithm that uses iterative policy evaluation.
The changes cause the program to terminate after 4 to 5 iterations.
However, the program returns a different policy than the expected.

I did these changes because my initial thought was that the linear and iterative approaches were supposed to return the same utility values for each state. Do you know if this is truly the case?

I found another Github https://github.com/SparkShen02/MDP-with-Value-Iteration-and-Policy-Iteration
that implements the modified policy iteration algorithm that uses iterative policy evaluation.

Although you use padding in your transitional matrix generator to account for boundary collisions, I suspect the linear algebra approach fails to detect wall boundary collisions which causes the optimal action to switch between it and an action that causes a wall collision.

I am not sure how to proceed. Please look into this for a possible fix. Thank you.

Problem in executing: "Montecarlo_control.py"

Dear Massimiliano,

I am trying to execute your code "Montecarlo_control.py" from post number 2.

I have got the following issue:

it seems that if(checkup_matrix[row, col] == 0): receives a row index that is a float and not a int value.
In this way it is not able to search index of the table.

Luca

adding optimal policy calculation in the value iteration algorithm

you could add an optimal policy evaluation after generate_graph in the value iteration algorithm

https://mpatacchiola.github.io/blog/2016/12/09/dissecting-reinforcement-learning.html

    generate_graph(graph_list)

#optimal policy evaluation
    pi = np.zeros(12)
    for s in range(tot_states):
        v = np.zeros(tot_states)
        v[s] = 1.0
        pi[s] = return_expected_action(v, T, u)
    pi[5] = np.NaN
    pi[3] = pi[7] = -1
    print(pi)

def return_expected_action(u, T, v):
    actions_array = np.zeros(4)
    for action in range(4):
         #Expected utility of doing a in state s, according to T and u.
         actions_array[action] = np.sum(np.multiply(u, np.dot(v, T[:,:,action])))
    return np.argmax(actions_array)

Part 3, TD(lambda): trace_matrix should be reset to zeroes at the beginning of each epoch

I believe that in part 3, TD(lambda), the trace_matrix should be reset to zeros at the beginning of each epoch. Otherwise the utility of a state may be updated even if the state is not part of the current trace.

Also, I believe that the decay of the trace_matrix should be moved to just before the line:
trace_matrix[observation[0], observation[1]] += 1

Two undefined variables

In the setPosition function function at line 102 there are two undefined variables (tot_row and tot_col).

about greedy agent in multi-armed bandit

https://github.com/mpatacchiola/dissecting-reinforcement-learning/blob/master/src/6/multi-armed-bandit/greedy_agent_bandit.py#L51

According to the code above, we first get the so far max utility and find which arm it is, then we do a np.random.choice on range(arm).
Why do we need to do a np.random.choice? Shouldn't the greedy agent just simply pick the argmax arm?

Missing brackets

There are missing brackets here (I am creating a pull request):

dissecting-reinforcement-learning/src/7/boolean_worlds_td.py

Line 58 in c25b3a4

[0.0, 0.1, 0.8, 0.1],

mdp linear algebra approach cannot stop

This is an excellent example.
However, when I tried the linear algebra approach in the mdp post, the while loop cannot stop.

typo in policy iteration algorithm on the site

https://mpatacchiola.github.io/blog/2016/12/09/dissecting-reinforcement-learning.html

should be (from sources):

def return_expected_action(p, u, T, v):
    actions_array = np.zeros(4)
    for action in range(4):
         #Expected utility of doing a in state s, according to T and u.
         actions_array[action] = np.sum(np.multiply(u, np.dot(v, T[:,:,action])))
    return np.argmax(actions_array)

also we don't need a parameter p

def return_expected_action(u, T, v):

11X11 grid

Hi @mpatacchiola i have 11X11 grid so how can i make transition_matrix
can you please help me?
Is there any generic code for creating transition_matrix by giving row and col of grid?

Print statement causing issue in Python 3.x

dissecting-reinforcement-learning/src/4/gridworld.py

Line 124 in c25b3a4

print graph

At line 124:
print graph should be print(graph)

Alternative to Numpy

I would like to try your code on the pyboard and the OpenMV boards. Unfortunately, Numpy is huge so it cannot be installed on a microcontroller. Will it be possible to using list, a bytearray, or an array.array; to implement the functions you are using from numpy?

The clean robot example on chapter 1 ?

Hello, I really don't understand this example in chapter one:

why the robot begin at state(1,1) and takes up (or down, left, right) action but have 3 subsequent states like that.
Thank you.

Looking forward to post #8

Thanks for posting the series.