Code Monkey home page Code Monkey logo

image-captioning's Introduction

Image-Captioning

In this project, a neural network architecture to automatically generate captions from images.

After using the Microsoft Common Objects in COntext (MS COCO) dataset to train the network, new captions will be generated based on new images.

Project Files

The project includes the following files:

  • model: containing the model architecture.
  • training: data pre-processing and training pipeline .
  • inference: generate captions on test dataset using the trained model.

Understanding LSTMs

Long-Short-Term-Memory network is a sequential architecture that allows to solve long-term dependency problems. Remembering information for long periods of time is practically their default behavior.

LSTMs architecture In order to achieve this long term behaviour LSTMs use 4 stages/gates as follows:

  • Forget gate
  • Learn gate
  • Remember gate
  • Use gate

The learn gate is a combination of current events with parts of short-term memory that weren't ignored by pass-through factor. Mathematically, the expression is the following:

Learn gate

where i is the ignoring factor given by a sigmoide between 0 and 1.

The forget gate is simply using long-term memories and forget part of it, creating a new memory.

Forget gate

The remember gate is just combining the forget and learning gate generating a new long-term memory.

Remeber gate

Finally, we need to decide what we’re going to output, i.e., use gate aka new short-term memory. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

Remeber gate

Methodology

For the representation of images, it was used a Convolutional Neural Network (CNN). They have been widely used and studied for image tasks, and are currently state-of-the art for object recognition and detection. The particular choice of CNN uses a novel approach to batch normalization and yields the current best performance on the ILSVRC 2014 classification competition. For a particular choice of CNN architecture it was used ResNet due to this performance on object classification on ImageNet.

Regarding the decoder, the choice of sequence generator LSTM is governed by its ability to dealwith vanishing and exploding gradients the most common challenge in designing and training RNNs. The following parameters were chosen to the LSTM architecture:

  • learning rate: 0.001
  • hidden size: 512
  • embed size: 512
  • number of LSTM cells: 1
  • batch size: 32

To select the embed and hidden size (=512) it was used this paper. In addition, dropout was used to avoid overfitting. In LSTM architecture it was used one layer based on previous mentioned paper, but a larger hidden size to provide it with a "larger memory". As a next step, it could be used a two cell LSTM layer.

Regarding the optimizer, Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative. The full Adam update also includes a bias correction mechanism, which compensates for the fact that in the first few time steps the vectors m,v are both initialized and therefore biased at zero, before they fully “warm up” (based on this reference).

Finally, for inference it was used a greedy algorithm where it find the maximum probability for each set of words in the output and returns the most likely word in the sequence.

General architecture

Dataset

  1. Clone this repo: https://github.com/cocodataset/cocoapi
git clone https://github.com/cocodataset/cocoapi.git  
  1. Setup the coco API (also described in the readme here)
cd cocoapi/PythonAPI  
make  
cd ..
  1. Download some specific data from here: http://cocodataset.org/#download (described below)
  • Under Annotations, download:

    • 2014 Train/Val annotations [241MB] (extract captions_train2014.json and captions_val2014.json, and place at locations cocoapi/annotations/captions_train2014.json and cocoapi/annotations/captions_val2014.json, respectively)
    • 2014 Testing Image info [1MB] (extract image_info_test2014.json and place at location cocoapi/annotations/image_info_test2014.json)
  • Under Images, download:

    • 2014 Train images [83K/13GB] (extract the train2014 folder and place at location cocoapi/images/train2014/)
    • 2014 Val images [41K/6GB] (extract the val2014 folder and place at location cocoapi/images/val2014/)
    • 2014 Test images [41K/6GB] (extract the test2014 folder and place at location cocoapi/images/test2014/)

References

TODO

  • Use the validation set to guide your search for appropriate hyperparameters.
  • Implement BLUE score metric
  • Implement beam search to generate captions on new images.
  • Tinker the model with attention to get research paper results
  • Use YOLO to object detection
  • Implementation of beam search
  • Use attention model in text generation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.