The ds-ultimate-tic-tac-toe's discuss from kaufmannlukas

Documentation & Clean-Up of Project

Check list of files that are updated/not updated yet:

Create an overview of what we have done and why and in which order.

Example PPO:

Research done 15th and 16th Nov
Started code implementation 16th Nov
Issues to get PPO running with the example we're using (continuous action space vs. discrete action space => categorical distribution, not multivariate (16th and 17th)

Example MCTS:

We created the basic UTTT game
We build the MCTS structure
We build the MCTS class and Node class, the two main ingredients of the MCTS code
We let MCTS play against a random agent
Given that MCTS itself is no Reinforcement Learning algorithm (and learning about RL and building RL algorithms is one of our main goals for this project), we explored options to update MCTS into a "kind-of" RL model.
We updated MCTS with a memory function: We are now able to give MCTS_agent_1 a memory from previous MCTS runs (i.e. with 1000 iterations in 100 plays)
MCTS_agent_1 (with memory) played against random_agent (100 wins, 0 draws, 0 losses)
We let MCTS_agent_1 play against MCTS_no_memory (50 wins, 11 draws, 39 losses)
Updated value for wins and losses - before it was -1 (loss), 0 (draw), 1 (win), now it is 0 (loss), 0.5 (draw), 1 (win). This is best practise, it allows us to see the values as winning probabilities and it rewards draws more effectively.
MCTS_agent_2 (with memory, with updated values) played against MCTS_no_mem (40 wins, 26 draws, 34 losses)
MCTS_agent_2 (with once updated memory, with updated values) played against MCTS_no_mem (48 wins, 24 draws, 28 losses)
MCTS_agent_2 (with twice updated memory, with updated values) played against MCTS_no_mem (51 wins, 23 draws, 26 losses). On top, MCTS_agent_2 plays as starting move (finally!) 5,5 - and first time when playing against him by ourselves, we were forced to save us into a draw. Done with MCTS on the 15th.

Example model selection:- We researched the following models and algorithms: ...

We selected the following models: MCTS, PPO, and as a possible 3rd option (if time): a combination of MCTS with AlphaZero
Why did we choose these models? MCTS: ... | PPO: ... | possibly MCTS + AlphaZero
What are the differences of our models? Sheet

Understand Object-oriented programming

class objects
class inside another class => Game._Player()
@property => built-in decorater for a function => "decorates the function"; i.e. current_player function can be treated as a variable, not as function;
| = or
finished_games => second part of the code of the line is for creating a two dimensional array out of the one dimensional won/not won array

Build UTTT code -> first simple gaming interface

Implement "Memory" for MCTS

Status quo:
We run and let MCTS "learn", but for the next run of MCTS we start again from Zero.

Idea:
Use the previous iterations and memorise (until a certain layer of moves) the previous playouts and values and nodes.

Approach:

Implement a "single-time" memory of 1 previous iteration => for now only 1 memory is given and the MCTS run uses this and updates this for one additional iteration => 1 run values stored for 1 additional run
Implement a loop where there is n iterations of MCTS runs, so that we update the "Memory" n times with the previous run values and n runs of MCTS and plays => all previous run values stored for n additional runs

Test issue to save item under it

Understand how to implement a Reward System

Decide on Final Interface

Web Interface?
Outsource it to UX/UI team?

REPOS on how to create an UTTT interface (with React):

Define Final Goals for our project

merge main (play and train) files

How to implement the coded game rules into the model?

Meeting with her (16th)

Questions:

How could we approach the INTERFACE?

frontend advise: What tools to use?
does she know render? if so, does she recommend it?

Did she work with PPO?

How should our game environment look like?

Code Baseline Model and implement it

Get Domain (xoxo-2.site, etc.)

create milestones (edit roadmap)

go through "TODO's" in the code

Ask UX/UI teams for Interface Help

Project name = XOXO² (Alpha Ultimate / Ultimate Tic-Tac-Toe / etc.)

XOXO²

Publishing playable online interface

Prepare Questions for Session 01 (First Review with Coaches/Teachers)

check out Google Cloud for training extensively

Ideas if anyone has time and nothing to do

Create/Edit README file in Github - explain project / whole process ( = guideline for presentation)

Work on documentation => see documentation task

Run 2 MCTS agents with different value system:

initially: -1 (loss), 0 (draw), 1 (win)
improved: 0 (loss), 0.5 (draw), 1 (win)

Create new function to find out the max/deepest depth of a memory - similar to count_nodes and count_leaves
deepest path in mcts.py

additional functions

Create table with all playouts - use EDA notebook

Research on PPO - Why did we choose it? And how to implement it?

Research info in our Model sheet:
https://docs.google.com/spreadsheets/d/1d6w3fd5od51H21_R5_ACFNvpBmRK5gTZIEclcFiervY/edit#gid=1847487093

research ML Model types (choose agent etc.)

Create an agent class that can interact with the game environment to make decisions. You can implement different agents, including the random baseline agent, MCTS-based agent, or more advanced AI models.

class RandomAgent:
    def __init__(self, num_actions):
        # Initialize the agent with relevant parameters
        ...

    def select_action(self, state):
        # Implement action selection logic (e.g., random choice)
        ...

Policy:

If you plan to implement more advanced algorithms, such as MCTS or deep reinforcement learning, you might use a policy class to represent the agent's strategy.

class Policy:
    def __init__(self, model):
        # Initialize the policy with a model (e.g., neural network)
        ...

    def select_action(self, state):
        # Implement action selection logic based on the policy model
        ...

Interactions and Game Loop:

Finally, in your main script or game loop, you can set up the game environment, instantiate the agent, and manage interactions. The game loop should involve the following steps:

# Initialize the game environment
game = UTTT()

# Initialize the agent
agent = RandomAgent(game.get_num_actions())  # Or use a more advanced agent if desired

while not game.is_game_over():
    # Get the current state from the game
    state = game.get_current_state()

    # The agent selects an action based on the state
    action = agent.select_action(state)

    # Update the game state based on the action
    game.perform_action(action)

# Determine the outcome and handle it accordingly
outcome = game.get_game_outcome()

Designing the Neural Network for the RL Agent

This happens after we defined the base structure for the agent.

NN SETUP

Input representation:
How to represent the game state as input to the neural network?
=> use CNN to capture spacial relationships within the board

Network architecture
Which architecture to choose for the NN?
i.e. deep neural networks, such as CNN
=> experiment with different network architectures and layer sizes to find the best
=> our network should also be able to process the game state and produce Q-values for different values

Output layer
=> the output layer of the chosen NN should have as many nodes as there are possible actions in the game
=> each node in the output layer corresponds to a different action the agent can take
=> the network should produce Q-values. They represent the expected future rewards for taking each action in the current state

Activation functions
Which activation function to choose for hidden layers?
i.e. ReLU or variants like Leaky ReLU

Loss function
Which loss function to choose?
i.e. MSE loss for Q-learning

Training Procedure
=> Train the NN (using data from self-play)

Exploration vs. Exploitation
Which exploration strategy to implement?
i.e. epsilon-greedy
=> Need to balance exploration (exploring new actions) vs. exploitation (choosing best-known actions)
=> crucial for a robust agent

TUNING

Regularization & Hyperparameters
=> experiment with regularization techniques to prevent overfitting
=> tune hyperparameters

Iterate and Experiment
=> train, evaluate, and adjust the network architecture and parameters

Evaluate and Monitor
=> monitor the agent's performance (does it make progress?)
=> evaluate and test against different opponents or strategies
i.e. let generation n play against generation n-1 or n-2 etc.

Code PPO & implement it

Define Goals of capstone

Learn about Reinforcement Learning
How do Rewards work? How does an Agent learn?
The final AI should be able to beat most human players
Having a playable (Web-) Interface with nice UI

QUESTIONS

do we want a self-play algorithm?

Fine tune model / evaluate and compare model types

Code Monte Carlo Tree Search & implement it

Here's a high-level overview of how you can implement MCTS for UTTT:

Define the Game Rules:

Start by defining the rules of UTTT. Understand the game mechanics, legal moves, and how the game state transitions from one position to another.

Create a UTTT Simulator:

Build a UTTT simulator that can represent the game state and allow you to make moves, check for wins, and determine valid moves.

MCTS Components:

Implement the core components of MCTS, which include the following:

Node Structure: Create a node structure that represents each state in the search tree. Each node should store information such as the number of visits, the total reward, and the possible actions.
Selection: Develop a selection strategy (e.g., Upper Confidence Bound) to choose nodes in the tree to explore further.
Expansion: When a selected node has unexplored actions, expand the tree by creating child nodes for those actions.
Simulation (Rollout): Simulate random playouts from a node to estimate the value of unexplored states. The rollout policy can be random or based on heuristics.
Backpropagation: Update the statistics of nodes as you backpropagate the results of rollouts to their parent nodes.

UCT Algorithm:

Consider using the Upper Confidence Bound for Trees (UCT) algorithm, which is a widely used selection strategy within MCTS. UCT balances exploration and exploitation.

Search and Decision-Making:

Create a search loop that repeatedly selects nodes to expand, simulate, and update based on MCTS until a stopping criterion (e.g., time limit or a maximum number of iterations) is reached.

Integration with UTTT Environment:

Integrate your MCTS implementation with the UTTT environment, allowing it to interact with the game and make decisions based on the search results.

Tuning Parameters:

Experiment with the parameters of the MCTS algorithm, such as exploration constant, to fine-tune its performance.

Parallelization (Optional):

If computational resources allow, consider parallelizing MCTS to speed up the search process.

MCTS Variants:

Explore advanced MCTS variants like Monte Carlo Tree Search with Upper Confidence Bounds applied to Trees (MCT-UCT) or other enhancements that may improve performance.

Testing and Evaluation:

Test your MCTS-based UTTT AI against various opponents or strategies to evaluate its performance and iteratively refine your implementation.

Optimization (Optional):

If your MCTS implementation is running too slowly, consider optimization techniques to make the search process more efficient.

Debugging and Profiling:

Use debugging tools and profiling to identify and address issues in your implementation.

LINKS

Good GitHub Repos for coding MCTS:

https://github.com/foersterrobert/AlphaZeroFromScratch/blob/main/4.AlphaMCTS.ipynb
(the 4 hour video of the guy with the hair)
https://github.com/JoshVarty/AlphaZeroSimple/blob/master/monte_carlo_tree_search.py
https://github.com/VAIBHAV-2303/MonteCarloTreeSearch/blob/master/src/mcts.py

MCTS install via pip:

https://github.com/pbsinclair42/MCTS/blob/master/mcts.py

Not too helpful Repos:

https://github.com/ar-nowaczynski/utttai/blob/main/scripts/mcts_generate.py - (NOT too helpful)
https://gist.github.com/qpwo/c538c6f73727e254fdc7fab81024f6e1 - (NOT really helpful)

kaufmannlukas / ds-ultimate-tic-tac-toe Goto Github PK

ds-ultimate-tic-tac-toe's Issues

Recommend Projects

Recommend Topics

Recommend Org