Code Monkey home page Code Monkey logo

infinite-horizon-off-policy-estimation's Issues

Discrete G

I'm trying to get the discrete (discounted) case to work for a toy mdp, but it doesn't seem to be giving sensical results. I think maybe I'm doing something wrong.

Could you explain what G, Nstate, Ghat are and how they relate to \Delta and k(s,s') in the paper?

Negative loss

Hi! I have a question about how loss is defined.

In the paper, the loss takes the form D(w) = L^2 = E[ d(w,s,a,s') d(w1, s1,a1,s1') k(s,s') ]. In other words, it has the form E[x^T K y] for x=d(w,s,a,s') and y = d(w1,s1,a1,s1'). This means that x^T K y is always positive (since E[x^T K y] = L^2 > 0). However, empirically, when running the sumo code, i'm seeing negative values for the loss_xx. I'm very confused by this. Is this a bug or is negative loss allowed?

screen shot 2019-03-08 at 10 45 50 am

^ shows loss_xx and self.loss for a few epochs of training. Notice that the loss is negative in some cases.

generating SARS data

hello,
i am new to SUMO.
when i run the sumo/collect_data.py, i find that

Error: Invalid vehicle id '0_NE_(1,1)'. Contains invalid characters.

and i delete all the files in directory ./data and generate new 'grids.net_0.xml' file using
netconvert -n data/grids.nod.xml -e data/grids.edg.xml -i data/grids.tlLogic.xml -o data/grids.net_0.xml' , it turns out that

Error: value '(0,1)' does not match regular expression facet '[^ \t\n\r|\\;,']+'

i wonder if the code in function 'generate_nodes ' producing the legal 'grids.nod.xml' file.
i want to know how to handle this case to generate SARS data successfully.
best

Question about discrete case

Hi!

In the density_ratio_estimate function, you have "x = quadratic_solver(n, G/50., regularizer)". Where does the 50. come from?

Confusion about Algorithm 1 output

Hi!

Slightly confused about the output of algorithm 1. It says that the off-policy estimate of Pi1 is given by v^T r / sum(v). Consider the case where the reward r is identically -1 for any transition (s,a,r,s'). Then the OPE estimate evaluates to -sum(v)/sum(v) = -1. Which is the correct average reward.

However, if we're considering a (long) variable-length finite horizon with gamma=1, then this algorithm will clearly not work. Do you suggest to use the second algorithm instead with a gamma very very close to 1?

Thanks!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.