zt95 / infinite-horizon-off-policy-estimation Goto Github PK

Python 100.00%

infinite-horizon-off-policy-estimation's Issues

Discrete G

I'm trying to get the discrete (discounted) case to work for a toy mdp, but it doesn't seem to be giving sensical results. I think maybe I'm doing something wrong.

Could you explain what G, Nstate, Ghat are and how they relate to \Delta and k(s,s') in the paper?

Negative loss

Hi! I have a question about how loss is defined.

In the paper, the loss takes the form D(w) = L^2 = E[ d(w,s,a,s') d(w1, s1,a1,s1') k(s,s') ]. In other words, it has the form E[x^T K y] for x=d(w,s,a,s') and y = d(w1,s1,a1,s1'). This means that x^T K y is always positive (since E[x^T K y] = L^2 > 0). However, empirically, when running the sumo code, i'm seeing negative values for the loss_xx. I'm very confused by this. Is this a bug or is negative loss allowed?

^ shows loss_xx and self.loss for a few epochs of training. Notice that the loss is negative in some cases.

generating SARS data

hello,
i am new to SUMO.
when i run the sumo/collect_data.py, i find that

Error: Invalid vehicle id '0_NE_(1,1)'. Contains invalid characters.

and i delete all the files in directory ./data and generate new 'grids.net_0.xml' file using
netconvert -n data/grids.nod.xml -e data/grids.edg.xml -i data/grids.tlLogic.xml -o data/grids.net_0.xml' , it turns out that

Error: value '(0,1)' does not match regular expression facet '[^ \t\n\r|\\;,']+'

i wonder if the code in function 'generate_nodes ' producing the legal 'grids.nod.xml' file.
i want to know how to handle this case to generate SARS data successfully.
best

Question about discrete case

Hi!

In the density_ratio_estimate function, you have "x = quadratic_solver(n, G/50., regularizer)". Where does the 50. come from?

Confusion about Algorithm 1 output

Hi!

Slightly confused about the output of algorithm 1. It says that the off-policy estimate of Pi1 is given by v^T r / sum(v). Consider the case where the reward r is identically -1 for any transition (s,a,r,s'). Then the OPE estimate evaluates to -sum(v)/sum(v) = -1. Which is the correct average reward.

However, if we're considering a (long) variable-length finite horizon with gamma=1, then this algorithm will clearly not work. Do you suggest to use the second algorithm instead with a gamma very very close to 1?

Thanks!!!

zt95 / infinite-horizon-off-policy-estimation Goto Github PK

infinite-horizon-off-policy-estimation's Issues

Discrete G

Negative loss

generating SARS data

Question about discrete case

Confusion about Algorithm 1 output

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent