Code Monkey home page Code Monkey logo

nlxgpt's Introduction

Official Code for NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
arXiv | video

[NEW] Gradio web-demo for VQA-X Hugging Face Spaces
[NEW] Gradio web-demo for ACT-X Hugging Face Spaces



Requirements

  • PyTorch 1.8 or higher
  • CLIP (install with pip install git+https://github.com/openai/CLIP.git)
  • transformers (install with pip install transformers)
  • accelerate for distributed training (install with pip install git+https://github.com/huggingface/accelerate)

Images Download

We conduct experiments on 4 different V/VL NLE Datasets: VQA-X, ACT-X, e-SNLI-VE and VCR. Please download the images into a folder in your directory named images using the following links (our code does not use pre-cached visual features. Instead, the features are extracted directly during code execution):

  • VQA-X: COCO train2014 and val2014 images
  • ACT-X: MPI images. Rename to mpi
  • e-SNLI-VE: Flickr30K images. Rename to flickr30k
  • VCR: VCR images. Rename to vcr

Annotations Download

We structure the annotations for the NLE datasets. You can dowloaded the structured annotations from here: VQA-X, ACT-X, e-SNLI-VE, VCR. Place them in nle_data/dataset_name/ directory. dataset_name can be {VQA-X, ACT-X, eSNLI-VE, VCR}. The pretraining annotations are here. Please see this issue also for clarification on which pretrain annotations to use. If you want to preprocess yourself rather than downloading the annotations directly, the code can be found in utils/nle_preprocess.ipynb.

You also need cococaption and the annotations in the correct format in order to perform evaluation on NLG metrics. We use the cococaption python3 toolkit here. Please download it and place the cococaption folder in your directory. The annotations in the correct format can be downloaded here. Please place them in the annotations folder. If you want to convert the natural language explanations data from the source to the format that cococaption expects for evaluation manually rather than downloading it directly, the code can be found in utils/preprocess_for_cococaption_eval.ipynb.

You will also need BertScore if you evaluate using it. You may install with pip install bert_score==0.3.7

Code

1 GPU is enough for finetuning on NLE. However if you wish to do distributed training, please setup first using accelerate. Note that you can still use accelerate even if you have 1 GPU. In your environment command line, type:

accelerate config

and answer the questions.

VQA-X

Please run from the command line with:

accelerate launch vqaX.py

Note: To finetune from the pretrained captioning model, please set the finetune_pretrained flag to True.

ACT-X

Please run from the command line with:

accelerate launch actX.py

Note: To finetune from the pretrained captioning model, please set the finetune_pretrained flag to True.

e-SNLI-VE

Please run from the command line with:

accelerate launch esnlive.py
e-SNLI-VE (+ Concepts)

Please run from the command line with:

accelerate launch esnlive_concepts.py
VCR

Please run from the command line with:

accelerate launch vcr.py

This will give you the unfiltered scores. After that, we use BERTScore to filter the incorrect answers and get the filtered scores (see paper Appendix for more details). Since BERTScore takes time to calculate, it is not ideal to run it and filter scores after every epoch. Therefore, we perform this operation once on the epoch with the best unfiltered scores. Please run:

python vcr_filter.py

Models

All models can be downloaded from the links below:

  • Pretrained Model on Image Captioning: link
  • VQA-X (w/o pretraining): link
  • VQA-X (w/ pretraining): link
  • ACT-X (w/o pretraining): link
  • ACT-X (w/ pretraining): link
  • Concept Head + Wordmap (used in e-SNLI-VE w/ concepts): link
  • e-SNLI-VE (w/o concepts): link
  • e-SNLI-VE (w/ concepts): link
  • VCR: link

Note: Place the concept model and its wordmap in a folder: pretrained_model/

Results

The output results (generated text) on the test dataset can be downloaded from the links below. _filtered means that the file contains only the explanations for which the predicted answer is correct. _unfiltered means that all the explanations are included, regardless of whether the predicted answer is correct or not. _full means the full output prediction (inclusing the answer + explanation). _exp means the explanation part only. All evaluation is performed on _exp. See section 4 of the paper for more details.

  • VQA-X (w/o pretraining): link
  • VQA-X (w/ pretraining): link
  • ACT-X (w/o pretraining): link
  • ACT-X (w/ pretraining): link
  • e-SNLI-VE (w/o concepts): link
  • e-SNLI-VE (w/ concepts): link
  • VCR: link

Please note that in case of VCR, the results shown in Page 4 of the appendix may not identically correspond to the results and pretrained model in the links above. We have trained several models and randomly picked one for presenting the qualitative results.

Proposed Evaluation Metrics

Please see explain_predict and retrieval_attack folders.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.