Confusion about Algorithm 1 output about infinite-horizon-off-policy-estimation HOT 9 OPEN

clvoloshin commented on August 20, 2024

Confusion about Algorithm 1 output

from infinite-horizon-off-policy-estimation.

Comments (9)

clvoloshin commented on August 20, 2024

In the case that I should use the second algorithm, what is f(s_0) in the RKHS case?

from infinite-horizon-off-policy-estimation.

zt95 commented on August 20, 2024

Thank you again for your interest.

1.For averaging case, our algorithm can still work though it is not theoretically sound. But in practice (like our result shown in experimental part) it can still estimate the average reward.

2.For discounted case, you can slightly change the loss function without changing the whole framework by feed in if the transition pair is the initial transition in your trajectory by feed in a placeholder called 'isStart', then we redo the \delta in paper (or x in code) as

x = (1-self.isStart) * w * self.policy_ratio + self.isStart * norm_w - w_next

and sample each transition (s_i, a_i, s_i') according to \gamma^i probability for discounted factor \gamma to feed in the data.

I may upload the discounted part when I have time, but you can definitely revise a little bit yourself without much effort.

from infinite-horizon-off-policy-estimation.

clvoloshin commented on August 20, 2024

Great! Awesome.

What is a quick justification for x = (1-self.isStart) * w * self.policy_ratio + self.isStart * norm_w - w_next? Maybe it's obvious but I don't quite see it.

from infinite-horizon-off-policy-estimation.

zt95 commented on August 20, 2024

In paper equation (15), we have probability (1-\gamma) to let \delta = 1-w_next, and since we need to normalized, 1 becomes norm_w instead. (I think there may be a typo in algorithm2, where it should read (1-w_0)f_0 )

from infinite-horizon-off-policy-estimation.

zt95 commented on August 20, 2024

and self.isStart = 1 or 0 boolean

from infinite-horizon-off-policy-estimation.

clvoloshin commented on August 20, 2024

Ah I see. Will the discounting still work even if gamma = 1? In other words, If I care about the (non-discounted) sum of rewards for a finite horizon?

from infinite-horizon-off-policy-estimation.

clvoloshin commented on August 20, 2024

With this new definition of \delta (or x), I'm still getting the average reward. The reason is as I stated eariler, if r is identically -1, then the output is -\sum(v) / \sum(v) = -1, the average reward (regardless of discounting or not). So I'd have to change this somehow.

Suppose the trajectory length is (on average) 100 for PI0, but the trajectory length is (on average) 150 for PI1. Then, what i'd like is that OPE(PI1) ~= -150 (assuming -1 reward for every step). However, -\sum(v)/\sum(v) = -1. Hm....

Should the output be v^T r / N, where N is the number of trajectories?

from infinite-horizon-off-policy-estimation.

zt95 commented on August 20, 2024

I'm not sure if I understand your question correctly. So you are proposing an empirical estimation where we divide N, the number of trajectories.

What we are doing is divided by the summation of all the importance weight, which leads to normalized style of estimator. In practice/theory it will lead to a much lower variance.

from infinite-horizon-off-policy-estimation.

zt95 commented on August 20, 2024

For your referrence:

https://papers.nips.cc/paper/5249-weighted-importance-sampling-for-off-policy-learning-with-linear-function-approximation.pdf

from infinite-horizon-off-policy-estimation.

Confusion about Algorithm 1 output about infinite-horizon-off-policy-estimation HOT 9 OPEN

Comments (9)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent