Code Monkey home page Code Monkey logo

kaufmannlukas / ds-ultimate-tic-tac-toe Goto Github PK

View Code? Open in Web Editor NEW
2.0 4.0 0.0 48.71 MB

XOXO² - Use Reinforcement Learning to train agent to play U_T-T-T.

Jupyter Notebook 48.44% Python 41.48% JavaScript 2.24% HTML 0.13% Vue 6.23% CSS 1.49%
agent-based-simulation gaming machine-learning mcts mcts-algorithm monte-carlo-tree-search neural-networks ppo proximal-policy-optimization reinforcement-learning ultimate-tic-tac-toe spicedacademy web-interface

ds-ultimate-tic-tac-toe's Introduction

XOXO²

xoxo

Introduction

The project was created during the final project phase of the Data Science Bootcamp at the Spiced Academy in Berlin in November 2023.
The project goal was to use Reinforcement Learning to teach an agent how to play Ultimate Tic-Tac-Toe (U_T-T-T).
In this group project, we first created a Monte Carlo Tree Search (MCTS) search algorithm from scratch. Then we used an Artificial Neural Network (ANN) to implement Proximal Policy Optimization (PPO) to improve the performance of our agent.
Addtionally, we implemented an interactive interface with flask and html, as a final product, where the user can play Ultimate Tic-Tac-Toe against our engine.

Project members:
Lukas Kaufmann, Paul Kotyrba, Nils Schillmann, Sreejith Thamban.
Thanks to Jannes Brunner.

Rules of Ultimate Tic-Tac-Toe

rules

Ultimate Tic-Tac-Toe (U_T-T-T) is played on nine tic-tac-toe boards arranged in a 3 × 3 grid.
Playing on a field inside a board (local game), determines the next board in which the opponent must play their next move.
The goal is to win three boards (local games) in a row.
You can play your next move at any board, if you are directed to play in a full board or a board that has been won and therefore is already closed.

draw

Interface Example

gif

About Reinforcement Learning

Reinforcement Learning is used not only in gaming environments, but also has use cases in robotics, self-driving cars and the development of generative AI. We were interested in exploring this topic, since it was not part of the curriculum of our bootcamp and has gained more and more importance over the years.

Model types

mcts+ppo

As mentioned before, we first created a Monte Carlo Tree Search Algorithm (MCTS) from scratch and additionally implemented a memory file from former iterations that the model can load and therefore learn from.
As a second model, we created a Neural Network structure to perform Proximal Policy Optimization (PPO).

MCTS

mcts

The MCTS works as follows: One iteration is completed after selecting a move, expanding and exploring the game tree (all nodes in the tree are possible game states or moves that can be played) and simulating the outcome from there by playing random moves. The outcome of this playout is than backpropagated to the top, along the path that has been played. The visit counts and reward values of these nodes are then updated. In order to access these results in the future, we implemented a memory function. This memory file grows large in size quickly, though.

PPO

ppo

In the PPO model, our agent performs an action (=move) in the environment (=game) and gets a new game state, as well as rewards for his actions. The actor network predicts probabilites for each move and the critic network estimates the value of the given board state. Compared to the MCTS memory file, the file sizes for the two networks are very small and therefore more useful in uploading them into the frontend / playable gaming interface.

Model evaluation

round1

In order to test the performance of our agents, we let them play U_T-T-T against our baseline model, which performs random moves, as well as against each other.

results1

Finally, we put our two strongest agents - MCTS with memory and PPO+MCTS - in the ring to play against each other.

round2

The final results show, that our PPO+MCTS agent has the strongest performance and wins about 60% of the games against MCTS with memory.

results2

It is therefore our current strongest agent and connected with the interface.

Files and Folders

The following tree-like folder structure diagram provides orientation over our repository:

.
├── data/
│   ├── eda/
│   │   └── ... ->  files for 'model_analysis.ipynb'
│   ├── local/
│   │   └── ... ->  local files, will not be uploaded to repository
│   └── models/
│       ├── mcts/
│       │   └── mcts_memory_small.pkl
│       └── ppo/
│           ├── ppo_v1_actor.pth
│           └── ppo_v1_critic.pth
├── frontend/
│   └── ds-uttt/
│       └── ... ->  interface setup
├── images/
│   └── ... ->  files for 'README.md'
├── logs
├── src/
│   ├── agents/
│   │   ├── network/
│   │   │   └── network.py
│   │   ├── agent.py
│   │   ├── human.py
│   │   ├── mcts.py
│   │   ├── ppo.py
│   │   └── random.py
│   ├── environments/
│   │   ├── game
│   │   └── uttt_env
│   ├── old_files/
│   │   └── ... ->  outdated files for comprehension purposes
│   ├── tests/
│   │   ├── test_game.py
│   │   └── test_mcts.py
│   ├── main_flask.py
│   ├── playout.py
│   ├── test_ppo.py
│   ├── train_mcts.py
│   └── train_ppo.py
├── environment.yml
├── model_analysis.ipynb
└── README.md

Setup Environment

To set up with anaconda / mamba:

conda env create --file environment.yml

To update the environment:

conda env update --file environment.yml

To activate the environment:

conda activate ds-uttt

Setup Interface

TO RUN GAME (backend):

  1. use the command: "flask --debug --app main_flask run" (Note: make sure you are in the correct folder)

  2. open a new terminal window, go to ../frontend/ds-uttt

TO RUN GAME (frontend): 3. use the command: "npm run dev"

  1. click or copy the URL: "http://localhost:5173/"

  2. The game should be running in your browser. (Note: if you experience some graphical issues, try another browser)

ds-ultimate-tic-tac-toe's People

Contributors

jannesbrunner avatar kaufmannlukas avatar paulzbigniew avatar thamban15 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ds-ultimate-tic-tac-toe's Issues

Ideas if anyone has time and nothing to do

Create/Edit README file in Github - explain project / whole process ( = guideline for presentation)


Work on documentation => see documentation task


Run 2 MCTS agents with different value system:

  • initially: -1 (loss), 0 (draw), 1 (win)
  • improved: 0 (loss), 0.5 (draw), 1 (win)

Create new function to find out the max/deepest depth of a memory - similar to count_nodes and count_leaves
deepest path in mcts.py

  • additional functions

Create table with all playouts - use EDA notebook

Designing the Neural Network for the RL Agent

This happens after we defined the base structure for the agent.


NN SETUP

Input representation:
How to represent the game state as input to the neural network?
=> use CNN to capture spacial relationships within the board

Network architecture
Which architecture to choose for the NN?
i.e. deep neural networks, such as CNN
=> experiment with different network architectures and layer sizes to find the best
=> our network should also be able to process the game state and produce Q-values for different values

Output layer
=> the output layer of the chosen NN should have as many nodes as there are possible actions in the game
=> each node in the output layer corresponds to a different action the agent can take
=> the network should produce Q-values. They represent the expected future rewards for taking each action in the current state

Activation functions
Which activation function to choose for hidden layers?
i.e. ReLU or variants like Leaky ReLU

Loss function
Which loss function to choose?
i.e. MSE loss for Q-learning

Training Procedure
=> Train the NN (using data from self-play)

Exploration vs. Exploitation
Which exploration strategy to implement?
i.e. epsilon-greedy
=> Need to balance exploration (exploring new actions) vs. exploitation (choosing best-known actions)
=> crucial for a robust agent


TUNING

Regularization & Hyperparameters
=> experiment with regularization techniques to prevent overfitting
=> tune hyperparameters

Iterate and Experiment
=> train, evaluate, and adjust the network architecture and parameters

Evaluate and Monitor
=> monitor the agent's performance (does it make progress?)
=> evaluate and test against different opponents or strategies
i.e. let generation n play against generation n-1 or n-2 etc.

Documentation & Clean-Up of Project

Check list of files that are updated/not updated yet:

  • src/
    • agents/
      • agent.py
      • human.py
      • random.py
      • mcts.py
      • ppo.py
      • network/
        • network.py
    • environments/
      • game
      • uttt_env
    • tests/
      • │test_game.py
      • │test_mcts.py
    • main_flask.py
    • mcts_train.py
    • ppo_train.py
    • main_play.py

Create an overview of what we have done and why and in which order.

Example PPO:

  1. Research done 15th and 16th Nov
  2. Started code implementation 16th Nov
  3. Issues to get PPO running with the example we're using (continuous action space vs. discrete action space => categorical distribution, not multivariate (16th and 17th)

Example MCTS:

  1. We created the basic UTTT game
  2. We build the MCTS structure
  3. We build the MCTS class and Node class, the two main ingredients of the MCTS code
  4. We let MCTS play against a random agent
  5. Given that MCTS itself is no Reinforcement Learning algorithm (and learning about RL and building RL algorithms is one of our main goals for this project), we explored options to update MCTS into a "kind-of" RL model.
  6. We updated MCTS with a memory function: We are now able to give MCTS_agent_1 a memory from previous MCTS runs (i.e. with 1000 iterations in 100 plays)
  7. MCTS_agent_1 (with memory) played against random_agent (100 wins, 0 draws, 0 losses)
  8. We let MCTS_agent_1 play against MCTS_no_memory (50 wins, 11 draws, 39 losses)
  9. Updated value for wins and losses - before it was -1 (loss), 0 (draw), 1 (win), now it is 0 (loss), 0.5 (draw), 1 (win). This is best practise, it allows us to see the values as winning probabilities and it rewards draws more effectively.
  10. MCTS_agent_2 (with memory, with updated values) played against MCTS_no_mem (40 wins, 26 draws, 34 losses)
  11. MCTS_agent_2 (with once updated memory, with updated values) played against MCTS_no_mem (48 wins, 24 draws, 28 losses)
  12. MCTS_agent_2 (with twice updated memory, with updated values) played against MCTS_no_mem (51 wins, 23 draws, 26 losses). On top, MCTS_agent_2 plays as starting move (finally!) 5,5 - and first time when playing against him by ourselves, we were forced to save us into a draw. Done with MCTS on the 15th.

Example model selection:- We researched the following models and algorithms: ...

  • We selected the following models: MCTS, PPO, and as a possible 3rd option (if time): a combination of MCTS with AlphaZero
  • Why did we choose these models? MCTS: ... | PPO: ... | possibly MCTS + AlphaZero
  • What are the differences of our models? Sheet

Meeting with her (16th)

Questions:

How could we approach the INTERFACE?

  • frontend advise: What tools to use?
  • does she know render? if so, does she recommend it?

Did she work with PPO?

  • How should our game environment look like?

Code Monte Carlo Tree Search & implement it

Here's a high-level overview of how you can implement MCTS for UTTT:


Define the Game Rules:

Start by defining the rules of UTTT. Understand the game mechanics, legal moves, and how the game state transitions from one position to another.

Create a UTTT Simulator:

Build a UTTT simulator that can represent the game state and allow you to make moves, check for wins, and determine valid moves.


MCTS Components:

Implement the core components of MCTS, which include the following:

  1. Node Structure: Create a node structure that represents each state in the search tree. Each node should store information such as the number of visits, the total reward, and the possible actions.
  2. Selection: Develop a selection strategy (e.g., Upper Confidence Bound) to choose nodes in the tree to explore further.
  3. Expansion: When a selected node has unexplored actions, expand the tree by creating child nodes for those actions.
  4. Simulation (Rollout): Simulate random playouts from a node to estimate the value of unexplored states. The rollout policy can be random or based on heuristics.
  5. Backpropagation: Update the statistics of nodes as you backpropagate the results of rollouts to their parent nodes.

UCT Algorithm:

Consider using the Upper Confidence Bound for Trees (UCT) algorithm, which is a widely used selection strategy within MCTS. UCT balances exploration and exploitation.

Search and Decision-Making:

Create a search loop that repeatedly selects nodes to expand, simulate, and update based on MCTS until a stopping criterion (e.g., time limit or a maximum number of iterations) is reached.


Integration with UTTT Environment:

Integrate your MCTS implementation with the UTTT environment, allowing it to interact with the game and make decisions based on the search results.


Tuning Parameters:

Experiment with the parameters of the MCTS algorithm, such as exploration constant, to fine-tune its performance.

Parallelization (Optional):

If computational resources allow, consider parallelizing MCTS to speed up the search process.


MCTS Variants:

Explore advanced MCTS variants like Monte Carlo Tree Search with Upper Confidence Bounds applied to Trees (MCT-UCT) or other enhancements that may improve performance.


Testing and Evaluation:

Test your MCTS-based UTTT AI against various opponents or strategies to evaluate its performance and iteratively refine your implementation.

Optimization (Optional):

If your MCTS implementation is running too slowly, consider optimization techniques to make the search process more efficient.

Debugging and Profiling:

Use debugging tools and profiling to identify and address issues in your implementation.


LINKS

Good GitHub Repos for coding MCTS:

MCTS install via pip:

Not too helpful Repos:

Define Goals of capstone

  • Learn about Reinforcement Learning
  • How do Rewards work? How does an Agent learn?
  • The final AI should be able to beat most human players
  • Having a playable (Web-) Interface with nice UI

QUESTIONS

do we want a self-play algorithm?

Understand Object-oriented programming

  • class objects
  • class inside another class => Game._Player()
  • @property => built-in decorater for a function => "decorates the function"; i.e. current_player function can be treated as a variable, not as function;
  • | = or
  • finished_games => second part of the code of the line is for creating a two dimensional array out of the one dimensional won/not won array

Define Code Structure

For more advanced AI algorithms, it's often better to keep the game environment and the AI agent separate. This separation allows for more flexibility and easier integration of different agents.

Here's a recommended structure for the implementation, including:

  • game environment
  • agent
  • policy
  • interactions

Game Environment (UTTT):

Create a UTTT class that encapsulates the game rules, state representation, and methods for interacting with the game. This class should include functions to:

  • Initialize the game state.
  • Determine valid actions.
  • Update the game state based on selected actions.
  • Check for game termination and outcomes.
class UTTT:
    def __init__(self):
        # Initialize game state, rules, and attributes
        ...

    def get_current_state(self):
        # Return the current game state
        ...

    def get_valid_actions(self):
        # Return a list of valid actions in the current state
        ...

    def perform_action(self, action):
        # Update the game state based on the selected action
        ...

    def is_game_over(self):
        # Check if the game is over and determine the outcome
        ...


Agent:

Create an agent class that can interact with the game environment to make decisions. You can implement different agents, including the random baseline agent, MCTS-based agent, or more advanced AI models.

class RandomAgent:
    def __init__(self, num_actions):
        # Initialize the agent with relevant parameters
        ...

    def select_action(self, state):
        # Implement action selection logic (e.g., random choice)
        ...


Policy:

If you plan to implement more advanced algorithms, such as MCTS or deep reinforcement learning, you might use a policy class to represent the agent's strategy.

class Policy:
    def __init__(self, model):
        # Initialize the policy with a model (e.g., neural network)
        ...

    def select_action(self, state):
        # Implement action selection logic based on the policy model
        ...


Interactions and Game Loop:

Finally, in your main script or game loop, you can set up the game environment, instantiate the agent, and manage interactions. The game loop should involve the following steps:

# Initialize the game environment
game = UTTT()

# Initialize the agent
agent = RandomAgent(game.get_num_actions())  # Or use a more advanced agent if desired

while not game.is_game_over():
    # Get the current state from the game
    state = game.get_current_state()

    # The agent selects an action based on the state
    action = agent.select_action(state)

    # Update the game state based on the action
    game.perform_action(action)

# Determine the outcome and handle it accordingly
outcome = game.get_game_outcome()


Implement "Memory" for MCTS

Status quo:
We run and let MCTS "learn", but for the next run of MCTS we start again from Zero.

Idea:
Use the previous iterations and memorise (until a certain layer of moves) the previous playouts and values and nodes.


Approach:

  1. Implement a "single-time" memory of 1 previous iteration => for now only 1 memory is given and the MCTS run uses this and updates this for one additional iteration => 1 run values stored for 1 additional run
  2. Implement a loop where there is n iterations of MCTS runs, so that we update the "Memory" n times with the previous run values and n runs of MCTS and plays => all previous run values stored for n additional runs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.