Code Monkey home page Code Monkey logo

mrfahrenhiet / explainable_image_caption_bot Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 1.0 12.19 MB

Explainable Image Caption Bot: Gives captions for images with explanations for each word in the caption (The part of the image where the AI model looks to generate a word). Uses Seq-to-Seq with attention model

Home Page: https://share.streamlit.io/mrfahrenhiet/explainable_image_caption_bot/app_streamlit.py

Python 100.00%
image-caption-bot computer-vision multi-modal nlp lstm resnet-101 attention-mechanism pytorch image-captioning attention

explainable_image_caption_bot's Introduction

Explainable Image Captioning

The goal of image captioning is to convert a given input image into a natural language description. In this appart from generating captions we will also show the regions where the model looks to generate a perticular word, through attention mechanism.

In this project I have implemented the Show, Attend and Tell Paper. Using Pytorch deep learning library.

New Streamlit APP

This project is now deployed as a streamlit app. Click here to view it

Instructions to Use:

git clone https://github.com/mrFahrenhiet/Explainable_Image_Caption_bot.git
pip install -r requirements.txt
# Note weights will get automatically downloaded.
# If you have already downloaded, just place 
# in the path and make the changes in app_streamlit file also

streamlit run app_streamlit.py # Run the app on local machine

App Screenshorts

Overview

In this project I have implemented the Show, Attend and Tell Paper This is not the current state-of-the-art, but is amazing

This model learns where to look.

As you generate a caption, word by word, you can see the model's gaze shifting across the image.

This is possible because of its Attention mechanism, which allows it to focus on the part of the image most relevant to the word it is going to utter next.

Here are some captions generated on test images not seen during training or validation:

Model Description

The model basically has 3 parts :-

  • Encoder: We encode the image in 49 x 2048 size vector using a pretrained ResNet-50 architechture
  • Attention: attention make use of 49 layers of encoded image to calculate context vectors which are to be passed on to the decoder.
  • Decoder: Each LSTM cell in the decoder recieves the context vector and given that, and the previous word predicted the decoder tries to predict the next word of the caption.


Credits: Show, Attend and Tell paper

Dataset Description

For this implementation I used the flickr8k dataset which contains a total of 8091 images and every image has 5 captions. For training, we used 7511 images and for validation we used 580 images. The results section gives a breif about the results we obtained and the metrics we used for evaluation.

Results

The metrics used for evaluation were cross entropy loss and BLEU score. The loss was used for both training and validation. I calculated the BLEU score over 580 images with 5 captions each. The model was trained for 50 epochs the best result was achieved in the 45th epoch. The graphs below explains the decrement of training and validation loss over the course of training the model.

losses Graph

The table belows shows the BLUE score our model obtained during the testing phase. The bleu score implementation can be found in this jupyter notebook

Flickr8K model

Torch metric Score
BLEU-1 0.53515
BLEU-2 0.35395
BLEU-3 0.23636
BLEU-4 0.15749

Colab Notebooks and Datasets

The colab notebooks used for training procedures and datasets used in the development of this project can be found here

Pretrained Model

If you do not want to train the model from scratch, you can use a pretrained model. You can download the pretrained model and vocab state_checkpoint file from here

Future Work

  • Implementing beam search at test phase
  • Implementing the whole network architecture using transformers
  • Using Flickr30k dataset and MS COCO dataset for training and evaluatuion.

explainable_image_caption_bot's People

Contributors

dependabot[bot] avatar mrfahrenhiet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

mbencherif

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.