kaufmannlukas / ds-ultimate-tic-tac-toe Goto Github PK

XOXO² - Use Reinforcement Learning to train agent to play U_T-T-T.

Jupyter Notebook 48.44% Python 41.48% JavaScript 2.24% HTML 0.13% Vue 6.23% CSS 1.49%

agent-based-simulation gaming machine-learning mcts mcts-algorithm monte-carlo-tree-search neural-networks ppo proximal-policy-optimization reinforcement-learning

ds-ultimate-tic-tac-toe's Introduction

Hello and Welcome! 👋

I'm a Data specialist - originating from architectural technologies and driven by a passion for converting data into actionable insights.

• 🏆 I recently graduated from the Data Science Bootcamp at the Spiced Academy in Berlin.
• 🌱 I’m currently learning all I can about Data engineering, analytics, and science.
• 🔭 I am open to job offers and will apply for data science and machine learning positions in 2024!
• 💬 See my LinkedIn for more and feel free to contact me!

ds-ultimate-tic-tac-toe's People

Contributors

Stargazers

Watchers

ds-ultimate-tic-tac-toe's Issues

Publishing playable online interface

Code Baseline Model and implement it

more testing (agents, etc.)

Define Goals of capstone

Learn about Reinforcement Learning
How do Rewards work? How does an Agent learn?
The final AI should be able to beat most human players
Having a playable (Web-) Interface with nice UI

QUESTIONS

do we want a self-play algorithm?

AlphaZero

create milestones (edit roadmap)

Ask UX/UI teams for Interface Help

Decide on Final Interface

Web Interface?
Outsource it to UX/UI team?

REPOS on how to create an UTTT interface (with React):

AlphaGo Approach

Documentation & Clean-Up of Project

Check list of files that are updated/not updated yet:

Create an overview of what we have done and why and in which order.

Example PPO:

Research done 15th and 16th Nov
Started code implementation 16th Nov
Issues to get PPO running with the example we're using (continuous action space vs. discrete action space => categorical distribution, not multivariate (16th and 17th)

Example MCTS:

We created the basic UTTT game
We build the MCTS structure
We build the MCTS class and Node class, the two main ingredients of the MCTS code
We let MCTS play against a random agent
Given that MCTS itself is no Reinforcement Learning algorithm (and learning about RL and building RL algorithms is one of our main goals for this project), we explored options to update MCTS into a "kind-of" RL model.
We updated MCTS with a memory function: We are now able to give MCTS_agent_1 a memory from previous MCTS runs (i.e. with 1000 iterations in 100 plays)
MCTS_agent_1 (with memory) played against random_agent (100 wins, 0 draws, 0 losses)
We let MCTS_agent_1 play against MCTS_no_memory (50 wins, 11 draws, 39 losses)
Updated value for wins and losses - before it was -1 (loss), 0 (draw), 1 (win), now it is 0 (loss), 0.5 (draw), 1 (win). This is best practise, it allows us to see the values as winning probabilities and it rewards draws more effectively.
MCTS_agent_2 (with memory, with updated values) played against MCTS_no_mem (40 wins, 26 draws, 34 losses)
MCTS_agent_2 (with once updated memory, with updated values) played against MCTS_no_mem (48 wins, 24 draws, 28 losses)
MCTS_agent_2 (with twice updated memory, with updated values) played against MCTS_no_mem (51 wins, 23 draws, 26 losses). On top, MCTS_agent_2 plays as starting move (finally!) 5,5 - and first time when playing against him by ourselves, we were forced to save us into a draw. Done with MCTS on the 15th.

Example model selection:- We researched the following models and algorithms: ...

We selected the following models: MCTS, PPO, and as a possible 3rd option (if time): a combination of MCTS with AlphaZero
Why did we choose these models? MCTS: ... | PPO: ... | possibly MCTS + AlphaZero
What are the differences of our models? Sheet

test 1

Define Code Structure

For more advanced AI algorithms, it's often better to keep the game environment and the AI agent separate. This separation allows for more flexibility and easier integration of different agents.

Here's a recommended structure for the implementation, including:

game environment
agent
policy
interactions

Game Environment (UTTT):

Create a UTTT class that encapsulates the game rules, state representation, and methods for interacting with the game. This class should include functions to:

Initialize the game state.
Determine valid actions.
Update the game state based on selected actions.
Check for game termination and outcomes.

class UTTT:
    def __init__(self):
        # Initialize game state, rules, and attributes
        ...

    def get_current_state(self):
        # Return the current game state
        ...

    def get_valid_actions(self):
        # Return a list of valid actions in the current state
        ...

    def perform_action(self, action):
        # Update the game state based on the selected action
        ...

    def is_game_over(self):
        # Check if the game is over and determine the outcome
        ...

Agent:

Create an agent class that can interact with the game environment to make decisions. You can implement different agents, including the random baseline agent, MCTS-based agent, or more advanced AI models.

class RandomAgent:
    def __init__(self, num_actions):
        # Initialize the agent with relevant parameters
        ...

    def select_action(self, state):
        # Implement action selection logic (e.g., random choice)
        ...

Policy:

If you plan to implement more advanced algorithms, such as MCTS or deep reinforcement learning, you might use a policy class to represent the agent's strategy.

class Policy:
    def __init__(self, model):
        # Initialize the policy with a model (e.g., neural network)
        ...

    def select_action(self, state):
        # Implement action selection logic based on the policy model
        ...

Interactions and Game Loop:

Finally, in your main script or game loop, you can set up the game environment, instantiate the agent, and manage interactions. The game loop should involve the following steps:

# Initialize the game environment
game = UTTT()

# Initialize the agent
agent = RandomAgent(game.get_num_actions())  # Or use a more advanced agent if desired

while not game.is_game_over():
    # Get the current state from the game
    state = game.get_current_state()

    # The agent selects an action based on the state
    action = agent.select_action(state)

    # Update the game state based on the action
    game.perform_action(action)

# Determine the outcome and handle it accordingly
outcome = game.get_game_outcome()

DQN

merge main (play and train) files

go through "TODO's" in the code

EDA file

Designing the Neural Network for the RL Agent

This happens after we defined the base structure for the agent.

NN SETUP

Input representation:
How to represent the game state as input to the neural network?
=> use CNN to capture spacial relationships within the board

Network architecture
Which architecture to choose for the NN?
i.e. deep neural networks, such as CNN
=> experiment with different network architectures and layer sizes to find the best
=> our network should also be able to process the game state and produce Q-values for different values

Output layer
=> the output layer of the chosen NN should have as many nodes as there are possible actions in the game
=> each node in the output layer corresponds to a different action the agent can take
=> the network should produce Q-values. They represent the expected future rewards for taking each action in the current state

Activation functions
Which activation function to choose for hidden layers?
i.e. ReLU or variants like Leaky ReLU

Loss function
Which loss function to choose?
i.e. MSE loss for Q-learning

Training Procedure
=> Train the NN (using data from self-play)

Exploration vs. Exploitation
Which exploration strategy to implement?
i.e. epsilon-greedy
=> Need to balance exploration (exploring new actions) vs. exploitation (choosing best-known actions)
=> crucial for a robust agent

TUNING

Regularization & Hyperparameters
=> experiment with regularization techniques to prevent overfitting
=> tune hyperparameters

Iterate and Experiment
=> train, evaluate, and adjust the network architecture and parameters

Evaluate and Monitor
=> monitor the agent's performance (does it make progress?)
=> evaluate and test against different opponents or strategies
i.e. let generation n play against generation n-1 or n-2 etc.

Ideas if anyone has time and nothing to do

Create/Edit README file in Github - explain project / whole process ( = guideline for presentation)

Work on documentation => see documentation task

Run 2 MCTS agents with different value system:

initially: -1 (loss), 0 (draw), 1 (win)
improved: 0 (loss), 0.5 (draw), 1 (win)

Create new function to find out the max/deepest depth of a memory - similar to count_nodes and count_leaves
deepest path in mcts.py

additional functions

Create table with all playouts - use EDA notebook

Research on PPO - Why did we choose it? And how to implement it?

Research info in our Model sheet:
https://docs.google.com/spreadsheets/d/1d6w3fd5od51H21_R5_ACFNvpBmRK5gTZIEclcFiervY/edit#gid=1847487093

Prepare Questions for Session 01 (First Review with Coaches/Teachers)

Implement "Memory" for MCTS

Status quo:
We run and let MCTS "learn", but for the next run of MCTS we start again from Zero.

Idea:
Use the previous iterations and memorise (until a certain layer of moves) the previous playouts and values and nodes.

Approach:

Implement a "single-time" memory of 1 previous iteration => for now only 1 memory is given and the MCTS run uses this and updates this for one additional iteration => 1 run values stored for 1 additional run
Implement a loop where there is n iterations of MCTS runs, so that we update the "Memory" n times with the previous run values and n runs of MCTS and plays => all previous run values stored for n additional runs

README file

Code PPO & implement it

Create Virtual Environment

Code Monte Carlo Tree Search & implement it

Here's a high-level overview of how you can implement MCTS for UTTT:

Define the Game Rules:

Start by defining the rules of UTTT. Understand the game mechanics, legal moves, and how the game state transitions from one position to another.

Create a UTTT Simulator:

Build a UTTT simulator that can represent the game state and allow you to make moves, check for wins, and determine valid moves.

MCTS Components:

Implement the core components of MCTS, which include the following:

Node Structure: Create a node structure that represents each state in the search tree. Each node should store information such as the number of visits, the total reward, and the possible actions.
Selection: Develop a selection strategy (e.g., Upper Confidence Bound) to choose nodes in the tree to explore further.
Expansion: When a selected node has unexplored actions, expand the tree by creating child nodes for those actions.
Simulation (Rollout): Simulate random playouts from a node to estimate the value of unexplored states. The rollout policy can be random or based on heuristics.
Backpropagation: Update the statistics of nodes as you backpropagate the results of rollouts to their parent nodes.

UCT Algorithm:

Consider using the Upper Confidence Bound for Trees (UCT) algorithm, which is a widely used selection strategy within MCTS. UCT balances exploration and exploitation.

Search and Decision-Making:

Create a search loop that repeatedly selects nodes to expand, simulate, and update based on MCTS until a stopping criterion (e.g., time limit or a maximum number of iterations) is reached.

Integration with UTTT Environment:

Integrate your MCTS implementation with the UTTT environment, allowing it to interact with the game and make decisions based on the search results.

Tuning Parameters:

Experiment with the parameters of the MCTS algorithm, such as exploration constant, to fine-tune its performance.

Parallelization (Optional):

If computational resources allow, consider parallelizing MCTS to speed up the search process.

MCTS Variants:

Explore advanced MCTS variants like Monte Carlo Tree Search with Upper Confidence Bounds applied to Trees (MCT-UCT) or other enhancements that may improve performance.

Testing and Evaluation:

Test your MCTS-based UTTT AI against various opponents or strategies to evaluate its performance and iteratively refine your implementation.

Optimization (Optional):

If your MCTS implementation is running too slowly, consider optimization techniques to make the search process more efficient.

Debugging and Profiling:

Use debugging tools and profiling to identify and address issues in your implementation.

LINKS

Good GitHub Repos for coding MCTS:

https://github.com/foersterrobert/AlphaZeroFromScratch/blob/main/4.AlphaMCTS.ipynb
(the 4 hour video of the guy with the hair)
https://github.com/JoshVarty/AlphaZeroSimple/blob/master/monte_carlo_tree_search.py
https://github.com/VAIBHAV-2303/MonteCarloTreeSearch/blob/master/src/mcts.py

MCTS install via pip:

https://github.com/pbsinclair42/MCTS/blob/master/mcts.py

Not too helpful Repos:

https://github.com/ar-nowaczynski/utttai/blob/main/scripts/mcts_generate.py - (NOT too helpful)
https://gist.github.com/qpwo/c538c6f73727e254fdc7fab81024f6e1 - (NOT really helpful)

Meeting with her (16th)

Questions:

How could we approach the INTERFACE?

frontend advise: What tools to use?
does she know render? if so, does she recommend it?

Did she work with PPO?

How should our game environment look like?

class objects
class inside another class => Game._Player()
@property => built-in decorater for a function => "decorates the function"; i.e. current_player function can be treated as a variable, not as function;
| = or
finished_games => second part of the code of the line is for creating a two dimensional array out of the one dimensional won/not won array