There might be a problem in the solution. One way to see it is the plots for both 'no

thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

MC predict's solution about reinforcement-learning HOT 7 CLOSED

dennybritz commented on May 13, 2024

MC predict's solution

from reinforcement-learning.

Comments (7)

fferreres commented on May 13, 2024

Policy(state) returns a probabilities array, which is not an action but a distribution over possible actions. The player can't make any action past 21 when it has no usable ace because the game is finished...Done. The agent can't make choices, not it can change any value, get new cards, etc.
The value of getting 22, or larger, is already factored into the value of doing the action of asking another card in previous steps, averaged as the expectation of Q of the actin to get another card - this is reflected in those actions. There is nothing to report beyond 21 because those states are terminal, and thus nothing can be improved - only the action that lead to > 21 can be optimized.

from reinforcement-learning.

dennybritz commented on May 13, 2024

Basically what fferreres said. It's definitely possible that there's a problem with the solution, but right now and I don't see it and it looks right to me. Closing this for now (feel free to re-open and elaborate more if you still think it's wrong)

from reinforcement-learning.

seanxwh commented on May 13, 2024

thanks @fferreres and @dennybritz for your time to take a look and explain. However, I'm still wondering why the no-usable-ace cases doesn't have anything about user getting busted and a negative reward, is it because the graphs are clipped?

from reinforcement-learning.

dennybritz commented on May 13, 2024

The plot shows the value function for all states, but there is no state where the player is beyond 21 points because the game will have ended by then.

from reinforcement-learning.

seanxwh commented on May 13, 2024

I see, thanks for helping :)

from reinforcement-learning.

fferreres commented on May 13, 2024

I think I also remember (in this implementation of the environment) that when a player has a usable ace and the player goes over 21, the code itself updates the value of the player's hand to -10 (eg. 23 becomes 13) and changes the state of the Usable Ace state variable to False, so results of Ace and No Ace are about best next action conditioned to you still having or not having an Ace. This is also why a player with a Usable Ace CAN go over 21, but at the same time will never reach of being over 21.

from reinforcement-learning.

seanxwh commented on May 13, 2024

@fferreres ya that is right, that is why somehow I saw the negative drop in my simulation only for no-usable-one cases

from reinforcement-learning.

Recommend Projects

MC predict's solution about reinforcement-learning HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent