Code Monkey home page Code Monkey logo

clip-grounding's Introduction

CLIP-grounding

Quantitative evaluation of CLIP's cross-modal grounding abilities via attention-explainability method.

Sample qualitative result

Abstract

Powerful multimodal models such as CLIP combine vision and language to reliably align image-text pairs. However, it is unclear if CLIP focuses on the right signals while aligning images and text. To answer this, we leverage a state-of-the-art attention-explainability method called Transformer-MM-Explainability and quantify how well CLIP grounds lingusitic concepts in images and visual concepts in text.

Towards this, we use the Panoptic Narrative Grounding benchmark proposed by Gonzalez et al. that provides fine-grained segmentation masks corresponding to the parts of sentences.

Setup

Follow the steps provided here to create a conda enviroment and activate it.

Dataset

Download the MSCOCO dataset (only validation images are required for this work) and its panoptic segmentation annotations by running:

bash setup/download_mscoco.sh

This shall result in the following folder structure:

data/panoptic_narrative_grounding
├── __MACOSX
│   └── panoptic_val2017
├── annotations
│   ├── panoptic_segmentation
│   ├── panoptic_train2017.json
│   ├── panoptic_val2017.json
│   └── png_coco_val2017.json
└── images
    └── val2017

6 directories, 3 files

⌛ This step takes about 30 minutes (depending on your Internet connection).

Demo

In order to run our code on samples from the PNG benchmark dataset, please run this notebook. It assumes that you have a conda environment setup as before and the dataset downloaded.

🤗 Alternatively, check out a Huggingface spaces demo here.

Quantitative evaluation

In order to reproduce our results of CLIP model on Panoptic Narrative Grounding (PNG) benchmark dataset, we use the following procedure:

  • Activate conda enviroment and set PYTHONPATH. Make sure you are at the repo root.
    conda activate clip-grounding
    export PYTHONPATH=$PWD
  • Run the evaluation script:

CLIP (multi-modal): To run evaluation with CLIP using both modes, run

python clip_grounding/evaluation/clip_on_png.py --eval_method clip

This shall save metrics in outputs/ folder. Result (numbers) are presented below.

CLIP (unimodal): To run a stronger baseline using only one modality in CLIP, run

python clip_grounding/evaluation/clip_on_png.py --eval_method clip-unimodal

Random baseline: To run baseline evaluation (with random attributions), run

python clip_grounding/evaluation/clip_on_png.py --eval_method random

The cross-modal grounding results for different variants are summarized in the following table:

Random CLIP-Unimodal CLIP
Text-to-Image (IoU) 0.2763 0.4310 0.4917
Image-to-Text (IoU) 0.2557 0.4570 0.5099

Acknowledgements

We'd like to thank the TAs, in particular, Jaap Jumelet and Tom Kersten, for useful initial discussions, and the course instructor Prof. Jelle Zuidema.

We greatly appreciate the open-sourced code/datasets/models from the following resources:

clip-grounding's People

Contributors

bpiyush avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.