Code Monkey home page Code Monkey logo

mpi_pytorch's Introduction

Scalable Analytics Project

Author: Erick Escobar Gallardo

Email: [email protected]

Date: 23/06/21

The project consist in implementation of an Image Classifier using Deep Neural Networks (CNNs).

Installation

Use the package manager pip. To install all the requirements, execute the following command:

pip install requirements.txt

Usage

In order to modify the training and validation constants as well as the directories, it is necessary to modify the information in the utils.py file. The execution process of the project can be divided in the following phases:

  1. Modification of directory paths in utils.py accordingly.
  2. Execution of create_dataset.py to create a smaller sample of the dataset and store it in ./data. This script will split the sample into a training and testing parts, each one with its respective dataframe.
  3. Modification of training settings in utils.py to set up the CNN architecture, the number of epochs, etc.
  4. Execution of training_job.sh using the command qsub training_job.sh to start the training of the CNN model . The training job will create a checkpoint inside the folder .\checkpoints.
  5. Execution of evaluation_job.sh using the command qsub evaluation_job.sh to execute the evaluation pipeline.
mpiexec -n 2 python -m mpi4py main.py

checkpoints: C:\Users\erick.cache\torch\hub\checkpoints

Development

Task 1: A simple neural network

We used different pre-defined Pytorch Computer Vision Architectures, among these architectures are: resnet18, resnet34, alexnet, vgg, squeezenet, densenet, inception.

The PyTorch parallelism is disabled using 'torch.set_num_threads(1)'. For this task a well structured training model is defined. To reduce training time, we can set the constant DEBUG to True that will take a sample of the original training dataset and use it to train the selected CNN architecture.

Task 2: MPI parallelism

In order to distribute the training process, first we scatter the dataset to all the nodes. For this me use MPI.Scatter to distribute the dataset among all the nodes. The dataset is split equally among all the processing nodes.

The distributed training process is done using the method MPI Allreduce that reduces (applies a SUM operation) to gradients of each process. Each process the averages the sum according to the total number processes.

Task 3: Pipelining

For the pipelining of the testing procedure. We use a simple approach that pipeline the process of reading an image, resize the image, preprocesses the image (normalize it) and input the image tensor to the model. This pipeline takes into account the total number of processes, where the first 3 processes are used for the first 3 task, and the rest of the processes are in charge of the model prediction part.

IMPORTANT: Remember to first start the training process for an architecture in order to create a checkpoint that will be used for the pipeline evaluation process.

Task 5: Deep Learning

For this task we used 2 different CNN architectures, each for 10 epochs.

Model name Validation score Testing Score
Resnet 34 0.626262 0.1123
Resnet 18 0.8862 0.1962
It is important to mention that the Testing score is lower since the Trained process was done over a smaller part of the oririnal training dataset, this to shorten the training time. We could only test two models due to the storage restrictions of 6 GB imposed by hydra. Other CNN architectures require more storage space.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Link to GitHub: https://github.com/erick093/MPI_Pytorch

License

MIT

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.