Code Monkey home page Code Monkey logo

neural-vqa-attention's Introduction

neural-vqa-attention

Torch implementation of an attention-based visual question answering model (Stacked Attention Networks for Image Question Answering, Yang et al., CVPR16).

Imgur

  1. Train your own network
    1. Extract image features
    2. Preprocess VQA dataset
    3. Training
  2. Use a pretrained model
    1. Pretrained models and data files
    2. Running evaluation
  3. Results

Intuitively, the model looks at an image, reads a question, and comes up with an answer to the question and a heatmap of where it looked in the image to answer it.

The model/code also supports referring back to the image multiple times (Stacked Attention) before producing the answer. This is supported via a num_attention_layers parameter in the code (default = 1).

NOTE: This is NOT a state-of-the-art model. Refer to MCB, MLB or HieCoAtt for that. This is a simple, somewhat interpretable model that gets decent accuracies and produces nice-looking results. The code was written about ~1 year ago as part of VQA-HAT, and I'd meant to release it earlier, but couldn't get around to cleaning things up.

If you just want to run the model on your own images, download links to pretrained models are given below.

Train your own network

Preprocess VQA dataset

Pass split as 1 to train on train and evaluate on val, and 2 to train on train+val and evaluate on test.

cd data/
python vqa_preprocessing.py --download True --split 1
cd ..
python prepro.py --input_train_json data/vqa_raw_train.json --input_test_json data/vqa_raw_test.json --num_ans 1000

Extract image features

Since we don't finetune the CNN, training is significantly faster if image features are pre-extracted. We use image features from VGG-19. The model can be downloaded and features extracted using:

sh scripts/download_vgg19.sh
th prepro_img.lua -image_root /path/to/coco/images/ -gpuid 0

Training

th train.lua

Use a pretrained model

Pretrained models and data files

All files available for download here.

  • san1_2.t7: model pretrained on train+val with 1 attention layer (SAN-1)
  • san2_2.t7: model pretrained on train+val with 2 attention layers (SAN-2)
  • params_1.json: vocabulary file for training on train, evaluating on val
  • params_2.json: vocabulary file for training on train+val, evaluating on test
  • qa_1.h5: QA features for training on train, evaluating on val
  • qa_2.h5: QA features for training on train+val, evaluating on test
  • img_train_1.h5 & img_test_1.h5: image features for training on train, evaluating on val
  • img_train_2.h5 & img_test_2.h5: image features for training on train+val, evaluating on test

Running evaluation

model_path=checkpoints/model.t7 qa_h5=data/qa.h5 params_json=data/params.json img_test_h5=data/img_test.h5 th eval.lua

This will generate a JSON file containing question ids and predicted answers. To compute accuracy on val, use VQA Evaluation Tools. For test, submit to VQA evaluation server on EvalAI.

Results

Format: sets of 3 columns, col 1 shows original image, 2 shows 'attention' heatmap of where the model looks, 3 shows image overlaid with attention. Input question and answer predicted by model are shown below examples.

More results available here.

Quantitative Results

Trained on train for val accuracies, and trained on train+val for test accuracies.

VQA v2.0

Method val test
SAN-1 53.15 55.28
SAN-2 52.82 -
d-LSTM + n-I 51.62 54.22
HieCoAtt 54.57 -
MCB 59.14 -

VQA v1.0

Method test-std
SAN-1 59.87
SAN-2 59.59
d-LSTM + n-I 58.16
HieCoAtt 62.10
MCB 65.40

References

Acknowledgements

License

MIT

neural-vqa-attention's People

Contributors

abhshkdz avatar

Watchers

Shubham Pachori avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.