matthewja / inverse-reinforcement-learning Goto Github PK

View Code? Open in Web Editor NEW

969.0 969.0 237.0 41 KB

Implementations of selected inverse reinforcement learning algorithms.

License: MIT License

Python 100.00%

inverse-reinforcement-learning reinforcement-learning

inverse-reinforcement-learning's People

Contributors

Stargazers

Watchers

Forkers

animesh-garg aporia3517 tigerneil haritoshi1226 jiangxihj vyraun qy-y facetohard alfiyazi amoliu wangxiao5791509 lenovor benjamesbabala cosmozhang dongleecsu codeaudit deepalcoholic ilblackdragon ksahare beflew pangyanbo ml-lab ml-ai-nlp-ir samratgit michaelshen97 santara emigmo mayurand puru07 pierrenowi cszmli yeping-hu wickywwz alexrobson imonce gandalfvn adeze leonardpatrick zmnoval framarbon-zz sxdkxgwan nipunagarwala zxsted samxchandler jfan2016 vikingmew jywa rosssong francescodelduchetto wz1938 meelement mjm522 rosvill aaronsnoswell annahedstroem williamd4112 tony1994513 uotter shamanez benielt zacrash junchenjin esgl gereon-boehm botyue latex123 binderwang afcarl hanfeijp davidishere avinwangzh sulbricha achenr krishnanpooja hayano1 hishidar peterpark77 alanxu89 clemnyan gradpratik ianchen28 dark-noisy-py megayeye himelys yuwei-chen ouyangzhibo pkoppol sapanachaudhary duweiqiang chi6 yit8 decoderkurt jiannan0721 jzinsa saminyeasar krandiash haochen3611 ai3dvision wangyy161 attler

inverse-reinforcement-learning's Issues

How to deal with non-tabular environment?

The environments of GridWorld and ObjectWorld are all tabular environments, in which the states are discreate and limited. We can easily write down the feature matrix by listing all possible states.
However, when we are dealing with more complicated non-tabular environments (such as Super Mario Game), it's impossible to represent the feature matrix by explicitly listing all possible states, since all states are continuous (e.g. any picture of Super Mario Game at time t) and infinite.
So, how to implement inverse reinforcement learning to deal with non-tabular environment like Super Mario Game? Anyone have any idea about this?

MaxEnt Efficient State Frequency Calculation

According to Ziebart's paper, the equation that updates the state visit frequency is as follows:

So, I think the implementation should be:
expected_svf[i, t] += (expected_svf[k, t-1] * policy[i, j] * # Stochastic policy transition_probability[i, j, k])

maxent seems to be using max instead of softmax for V_soft?

In the backwards pass of MaxEnt (Algo 9.1 Brian's thesis), MaxEnt uses a softmax calculation to update the V function (soft Value function), but maxent.py seems to call value_iteration.optimal_value which calculates the hard Value function that is it uses max instead of softmax. This seems like a bug.

Also the initialization seems kind of weird, atleast for gridworld settings only the final state should be initialized to 0 while all others should be -infinity but value_iteration.optimal_value seems to set everything to 0 initially. Any reason for this discrepancy?

Code for reference: https://github.com/MatthewJA/Inverse-Reinforcement-Learning/blob/master/irl/value_iteration.py#L63

Code help

I am having a little bit of issues with this part of the code: φ = T.nnet.sigmoid(th.compile.ops.Rebroadcast((0, False), (1, True))(b) + W.dot(φs[-1])).
Traceback (most recent call last):
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\gof\vm.py", line 301, in call
thunk()
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\gof\op.py", line 892, in rval
r = p(n, [x[0] for x in i], o)
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\tensor\blas.py", line 1552, in perform
z[0] = np.asarray(np.dot(x, y))
ValueError: ('shapes (3,10) and (2,41) not aligned: 10 (dim 1) != 2 (dim 0)', (3, 10), (2, 41))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "ID_grouping.py", line 363, in
learning_rate,initialisation="normal",l1=0.1,l2=0.1))
File "ID_grouping.py", line 355, in irl
reward = train(reshaped_to_2d)
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\compile\function_module.py", line 903, in call
self.fn() if output_subset is None else
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\gof\vm.py", line 305, in call
link.raise_with_op(node, thunk)
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\gof\link.py", line 325, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\six.py", line 702, in reraise
raise value.with_traceback(tb)
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\gof\vm.py", line 301, in call
thunk()
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\gof\op.py", line 892, in rval
r = p(n, [x[0] for x in i], o)
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\tensor\blas.py", line 1552, in perform
z[0] = np.asarray(np.dot(x, y))
ValueError: ('shapes (3,10) and (2,41) not aligned: 10 (dim 1) != 2 (dim 0)', (3, 10), (2, 41))
Apply node that caused the error: Dot22(W, x.T)
Toposort index: 28
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(3, 10), (2, 41)]
Inputs strides: [(80, 8), (8, 16)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Elemwise{Composite{scalar_sigmoid((i0 + i1))}}[(0, 1)](Rebroadcast{?,1}.0, Dot22.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
File "ID_grouping.py", line 363, in
learning_rate,initialisation="normal",l1=0.1,l2=0.1))
File "ID_grouping.py", line 307, in irl
φ = T.nnet.sigmoid(th.compile.ops.Rebroadcast((0, False), (1, True))(b) + W.dot(φs[-1]))

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

Unexpected reward estimate

Hello, thanks for sharing your code. I'm running examples/lp_gridworld.py and seeing this reward estimate, which looks good:

However, when I change the body of gridworld.reward to e.g.:

    def reward(self, state_int):
        if state_int == 2:  # Goal state now in bottom right of 3x3, not top right
            return 1
        return 0

... then I see this reward estimate:

i.e. linear_irl.irl seems to assume that the 'goal state' is in the top right. Have I got something wrong? How can I get linear IRL to work with different goal states? Thanks.

Py2 or Py3?

When I test the code.
There is a fault about the super() function.
So, do you implement the code in python 3?

sum of Gridworld.transition_probability is not 1

gw = gridworld.Gridworld(5, .3, .2)
gw.transition_probability[7,0,:].reshape(5,5)

outputs

array([
[ 0. , 0. , 0.075, 0. , 0. ],
[ 0. , 0.075, 0.075, 0.775, 0. ],
[ 0. , 0. , 0.075, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ]])

But should sum of this be 1 ?

About feature_matrix

     Hello！I am a graduate student from China,and in your demo,the feature_matrix is not provided,so when giving state space,how to obtain the feature_matrix?I would be grateful if you could answer.
    Apart from that, I have some questions about the details of inverse reinforcement learning,If you can provide your contact information,I would  very appreciate.

Are Ziebart's thesis, equation 9.2 and find_policy() function the same?

Hi Matthew!

This repo is just great: It works, its transparant and modular!

I only found two differences between Ziebart's thesis and your implementation.
Can you let me know if you were aware of them?

So here is Eq 9.2:

Here is your code:

And here is Eq 9.1:

Which uses $V^{\text{soft}}$:

And here is your code:

You include a discount factor in Eq 9.2, and in 9.1 you convert a subtraction ($Q^{\text{soft}}-V^{\text{soft}}$) into a fraction ($\frac{Q^{\text{soft}}}{V^{\text{soft}}}$), correct?

Theano package error

I am currently running code and am getting an error:
I reshaped the feature matrix as reshaped_to_2d_reshape as (num_of_states, num_of_dimensions)
and am still getting this error.
I am not sure how to debug it.
Traceback (most recent call last):
File "ID_grouping.py", line 409, in
learning_rate,initialisation="normal",l1=0.1,l2=0.1))
File "ID_grouping.py", line 398, in irl
reward = train(reshaped_to_2d_reshape[0])
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\compile\function_module.py", line 813, in call
allow_downcast=s.allow_downcast)
File "C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\tensor\type.py", line 178, in filter
data.shape))
TypeError: Bad input argument to theano function with name "ID_grouping.py:388" at index 0 (0-based).
Backtrace when that variable is created:

File "ID_grouping.py", line 409, in
learning_rate,initialisation="normal",l1=0.1,l2=0.1))
File "ID_grouping.py", line 316, in irl
s_feature_matrix = T.matrix("x")
Wrong number of dimensions: expected 2, got 1 with shape (5,).
What version of theano and theano.tensor you use in your code?
This occurs with deep_maxent.py

IRL large state space

I would like to ask you if you can list some references in order to understand how you formulated the block matrix form of the linear program for solving large state space problem according the paper by Ng. I am missing how you treat the function p(x) = x if x>0 or 2x if x<0 that is part of the objective function.

Theano package help

I need help to try to remove this warning:
WARNING (theano.configdefaults): g++ not available, if using conda: conda install m2w64-toolchain C:\Users\Sankalp Chauhan\AppData\Local\Programs\Python\Python37\lib\site-packages\theano\configdefaults.py:560: UserWarning: DeprecationWarning: there is no c++ compiler.This is deprecated and with Theano 0.11 a c++ compiler will be mandatory warnings.warn("DeprecationWarning: there is no c++ compiler." WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.
My code is working but it is taking a long time to get output

Feature matrix

Can you give some points on how you designed the feature matrix? You kept it as 25*25 (in case of Gridword), where each state is represented separately. According to my understanding, states have to be grouped together according to their characteristics (eg. goal state, ground states, puddles states etc). Then why did you characterized each state separately?