Code Monkey home page Code Monkey logo

Comments (9)

clvoloshin avatar clvoloshin commented on August 20, 2024

In the case that I should use the second algorithm, what is f(s_0) in the RKHS case?

from infinite-horizon-off-policy-estimation.

zt95 avatar zt95 commented on August 20, 2024

Thank you again for your interest.

1.For averaging case, our algorithm can still work though it is not theoretically sound. But in practice (like our result shown in experimental part) it can still estimate the average reward.

2.For discounted case, you can slightly change the loss function without changing the whole framework by feed in if the transition pair is the initial transition in your trajectory by feed in a placeholder called 'isStart', then we redo the \delta in paper (or x in code) as

x = (1-self.isStart) * w * self.policy_ratio + self.isStart * norm_w - w_next

and sample each transition (s_i, a_i, s_i') according to \gamma^i probability for discounted factor \gamma to feed in the data.

I may upload the discounted part when I have time, but you can definitely revise a little bit yourself without much effort.

from infinite-horizon-off-policy-estimation.

clvoloshin avatar clvoloshin commented on August 20, 2024

Great! Awesome.

What is a quick justification for x = (1-self.isStart) * w * self.policy_ratio + self.isStart * norm_w - w_next? Maybe it's obvious but I don't quite see it.

from infinite-horizon-off-policy-estimation.

zt95 avatar zt95 commented on August 20, 2024

In paper equation (15), we have probability (1-\gamma) to let \delta = 1-w_next, and since we need to normalized, 1 becomes norm_w instead. (I think there may be a typo in algorithm2, where it should read (1-w_0)f_0 )

from infinite-horizon-off-policy-estimation.

zt95 avatar zt95 commented on August 20, 2024

and self.isStart = 1 or 0 boolean

from infinite-horizon-off-policy-estimation.

clvoloshin avatar clvoloshin commented on August 20, 2024

Ah I see. Will the discounting still work even if gamma = 1? In other words, If I care about the (non-discounted) sum of rewards for a finite horizon?

from infinite-horizon-off-policy-estimation.

clvoloshin avatar clvoloshin commented on August 20, 2024

With this new definition of \delta (or x), I'm still getting the average reward. The reason is as I stated eariler, if r is identically -1, then the output is -\sum(v) / \sum(v) = -1, the average reward (regardless of discounting or not). So I'd have to change this somehow.

Suppose the trajectory length is (on average) 100 for PI0, but the trajectory length is (on average) 150 for PI1. Then, what i'd like is that OPE(PI1) ~= -150 (assuming -1 reward for every step). However, -\sum(v)/\sum(v) = -1. Hm....

Should the output be v^T r / N, where N is the number of trajectories?

from infinite-horizon-off-policy-estimation.

zt95 avatar zt95 commented on August 20, 2024

I'm not sure if I understand your question correctly. So you are proposing an empirical estimation where we divide N, the number of trajectories.

What we are doing is divided by the summation of all the importance weight, which leads to normalized style of estimator. In practice/theory it will lead to a much lower variance.

from infinite-horizon-off-policy-estimation.

zt95 avatar zt95 commented on August 20, 2024

For your referrence:

https://papers.nips.cc/paper/5249-weighted-importance-sampling-for-off-policy-learning-with-linear-function-approximation.pdf

from infinite-horizon-off-policy-estimation.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.