Code Monkey home page Code Monkey logo

visual_question_answering's Introduction

Introduction

This neural system for visual question answering is roughly based on the paper "Dynamic Memory Networks for Visual and Textual Question Answering" by Xiong et al. (ICML2016). The input is an image and a question about the image, and the output is a one-word answer to this question. It uses a convolutional neural network to extract visual features from the image, and uses a bi-directional GRU recurrent neural network to fuse these features. Meanwhile, it uses either a GRU recurrent neural network or a positional encoding scheme to encode the question. Then, it utilizes a dynamic memory network with an attention mechanism to generate the answer based on this information. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.

Prerequisites

Usage

  • Preparation: Download the COCO train2014 and val2014 images here. Put the COCO train2014 images in the folder train/images, and put the COCO val2014 images in the folder val/images. Download the VQA v1 training and validation questions and annotations here. Put the file mscoco_train2014_annotations.json and OpenEnded_mscoco_train2014_questions.json in the folder train. Similarly, put the file mscoco_val2014_annotations.json and OpenEnded_mscoco_val2014_questions.json in the folder val. Furthermore, download the pretrained VGG16 net here or ResNet50 net here if you want to use it to initialize the CNN part.

  • Training: To train a model using the VQA v1 training data, first setup various parameters in the file config.py and then run a command like this:

python main.py --phase=train \
    --load_cnn \
    --cnn_model_file='./vgg16_no_fc.npy'\
    [--train_cnn]    

Turn on --train_cnn if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder models. If you want to resume the training from a checkpoint, run a command like this:

python main.py --phase=train \
    --load \
    --model_file='./models/xxxxxx.npy'\
    [--train_cnn]

To monitor the progress of training, run the following command:

tensorboard --logdir='./summary/'
  • Evaluation: To evaluate a trained model using the VQA v1 validation data, run a command like this:
python main.py --phase=eval --model_file='./models/xxxxxx.npy'

The result will be shown in stdout. Furthermore, the generated answers will be saved in the file val/results.json.

  • Inference: You can use the trained model to answer any questions about any JPEG images! Put such images in the folder test/images. Also, create a CSV file containing your questions (this file should have three fields: image, question, question_id), and put it in the folder test. Then run a command like this:
python main.py --phase=test --model_file='./models/xxxxxx.npy'

The generated answers will be saved in the folder test/results.

Results

A pretrained model with default configuration can be downloaded here. This model was trained solely on the VQA v1 training data. It achieves accuracy 60.35% on the VQA v1 validation data. Here are some successful examples: examples

References

visual_question_answering's People

Contributors

deeprnn avatar

Watchers

James Cloos avatar Abdalla Hassan Awale avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.