Suppose there is a game, a grid 10 by 10 ,each position was placed a piece of gold wit

Hello! According to your deion, first please confirm: is the episode length fixe

Hello! According to your deion, first please confirm: is the episod

how to well model a grid env when it changes frequently? about lightzero HOT 8 OPEN

valkryhx commented on June 1, 2024

how to well model a grid env when it changes frequently?

from lightzero.

Comments (8)

puyuan1996 commented on June 1, 2024

To effectively model this game environment and train an agent to maximize the value of gold nuggets obtained from a 10x10 grid, you can take the following steps:

1. Environment Modeling

First, you need to build a simulation environment that can reflect changes in the grid state after each agent's operation. Based on your description, this environment should conform to the Markov Decision Process (MDP) model commonly used in reinforcement learning. Specifically, you can modify this base environment. You can refer to customize environments.

State Representation:

Grid State: A 10x10 matrix, where each element represents the gold nugget value at that position.
Visited Marking: Optionally, a matrix of the same size, marking the positions that have been mined.

Actions:

The agent can choose to mine any grid position that has not yet been mined. Each action can be represented as a coordinate (i, j) in the grid.

Rewards:

When the agent chooses to mine at position (i, j), the reward is the gold nugget value v(i, j) at that position. Subsequently, all values in that row and column will be set to 0.

Transitions:

After choosing a position, update the grid to reflect the state of the destroyed gold nuggets in that row and column.

2. Observation Modeling

For observations, usually providing the current complete grid state is sufficient, as it contains all the information needed for the next decision. This is similar to many classic board games like Go or Chess, where the agent needs to consider the current global state to make decisions.

Single Frame: Provide only the current grid state at each step.
Historical Information: Considering some scenarios where it might be necessary to evaluate the impact of previous decisions on the current state, you might consider providing the last few steps' states as additional information to the agent. This can be achieved by stacking the grid states of several recent moments. (However, this is unnecessary in a situation that satisfies the MDP)

3. Training Methods

For training the agent, you could start with MuZero. This configuration guide shows how to run the algorithm on a custom environment.

4. Performance Evaluation

It's important to continually evaluate the agent's performance during development. This can be done by calculating the average score of the agent across multiple independent test environments. Also, monitor any potential issues in the agent's decision-making process, such as an excessive focus on short-term benefits at the expense of long-term strategy. You can refer to this log documentation.

Summary

It's recommended to start with a simple model, using only the current grid state as the observation, and then gradually increase complexity, such as introducing historical states or improving learning algorithms, to optimize the agent's performance. Through continuous iteration and testing, you can find the most suitable approach for this specific problem.

from lightzero.

valkryhx commented on June 1, 2024

Thank you for your detailed answer.
I have implemented the codes above but I found the agent's performance is weak.
I use a heuristic algorithm to act as the baseline. This policy is naive and greedy: always choose the position with the max score on current grid. So this policy does not take into account the long-term effect of current choice, and the greedy action can indeed affect (in the bad way ) the potential to get max total sum of score.
But though it is a naive policy, it behaves better than the agent with muzero alg .
When I initialize the grid10X10 with normal random of (0,1) , the greedy agent can get the total sum around 4+,
but the muzero/effect-muzero agent can only get 2 to 3 or even less total score. Each agent can finish the game with exactly 5 steps and the grid is then fullfilled and done with 0.
The observation I use is the last 3 frames (including current state ) of the grid. The training steps is about 4*10^4.
I don't know how to promote the performance of the agent .

from lightzero.

puyuan1996 commented on June 1, 2024

Hello! According to your description, first please confirm: is the episode length fixed at 5 steps? If MuZero is performing poorly and there's no issue with the environment part you've written, one possible reason could be the configuration settings are not suitable. Since the episodes in your environment are very short, many hyperparameters may need corresponding adjustments. Could you please provide your complete configuration file and training loss records? You might want to refer to and modify the configuration file in this link.

from lightzero.

valkryhx commented on June 1, 2024

Hello! According to your description, first please confirm: is the episode length fixed at 5 steps? If MuZero is performing poorly and there's no issue with the environment part you've written, one possible reason could be the configuration settings are not suitable. Since the episodes in your environment are very short, many hyperparameters may need corresponding adjustments. Could you please provide your complete configuration file and training loss records? You might want to refer to and modify the configuration file in this link.

This is the config , and when the grid size is fixed with 10 , I set the action steps to be always 5,because 5(proper and not conflict or digged position ) is enough to complete the game.

from lightzero.

puyuan1996 commented on June 1, 2024

Hello, I recommend starting by adjusting MuZero rather than EfficientZero, as the latter adds complexity, particularly in predicting the value prefix, which may not be advantageous in your environment. To better diagnose the issue, it would be helpful if you could provide specific log records on TensorBoard. Currently, your action space is set at 100, which could lead to insufficient exploration and potential convergence to local optima. Furthermore, the length of each episode is very short, only 5, with default settings of num_unroll_steps=5 and td_steps=5. You might need to debug to ensure that these data boundaries are being handled appropriately. Our code might not have been adequately tested with such short episodes previously. Thank you.

from lightzero.

valkryhx commented on June 1, 2024

Thank you for your reply!
I have some questions that I do not know how to solve:
1.the grid size is 10 and 100 positions can be chosen , so I simply use a 100 action space, but just as you said ,this will make more difficult for the agent to learn proper action when facing some observation.So how to lower the size of action space then? Not like the basic left-right-up-down 4 moves agent in a grid, my env is more like a Go board and a stone is placed.
2. the length of each episode is very short, only 5, with default settings of num_unroll_steps=5 and td_steps=5, it's just the case of grid_10_by_10, when the grid size becomes bigger,the value 5 can be bigger . Now when I set it to 5 , is the value invalid ?

from lightzero.

puyuan1996 commented on June 1, 2024

Hello, the action space is fixed at 100, which is considerably smaller compared to the maximum action space of 19*19 in Go. Therefore, theoretically, MuZero should manage this scale effectively. If the learning performance is currently suboptimal, it could be attributed to the action space. This hypothesis can be verified by monitoring metrics such as loss and policy_entropy in TensorBoard. If MuZero is indeed encountering a local optimum, consider increasing the temperature parameter or employing the epsilon-greedy strategy for adjustments. As for the boundary condition when the episode length is set to 5, LightZero should handle it proficiently, but you still need to perform debugging and verification locally to confirm.

from lightzero.

valkryhx commented on June 1, 2024

Hello, the action space is fixed at 100, which is considerably smaller compared to the maximum action space of 19*19 in Go. Therefore, theoretically, MuZero should manage this scale effectively. If the learning performance is currently suboptimal, it could be attributed to the action space. This hypothesis can be verified by monitoring metrics such as loss and policy_entropy in TensorBoard. If MuZero is indeed encountering a local optimum, consider increasing the temperature parameter or employing the epsilon-greedy strategy for adjustments. As for the boundary condition when the episode length is set to 5, LightZero should handle it proficiently, but you still need to perform debugging and verification locally to confirm.

Thanks for your help! Where can I set the temperature?

from lightzero.

how to well model a grid env when it changes frequently? about lightzero HOT 8 OPEN

Comments (8)

1. Environment Modeling

2. Observation Modeling

3. Training Methods

4. Performance Evaluation

Summary

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent