Code Monkey home page Code Monkey logo

camera's Introduction

Context-Aware Multi-View Summarization Network for Image-Text Matching(CAMERA)

PyTorch code of the paper "Context-Aware Multi-View Summarization Network for Image-Text Matching". It is built on top of VSRN and SAEM.

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. "Context-Aware Multi-View Summarization Network for Image-Text Matching", ACM MM, 2020. [pdf]

Introduction

Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multi-view summarization network to summarize context-enhanced visual region information from multiple views. To be more specific, we design an adaptive gating self-attention module to extract representations of visual regions and words. By controlling the internal information flow, we are able to adaptively capture context information. Afterwards, we introduce a summarization module with a diversity regularization to aggregate region-level features into image-level ones from different perspectives. Ultimately, we devise a multi-view matching scheme to match multi-view image features with corresponding text ones. To justify our work, we have conducted extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, which demonstrates the superiority of our model as compared to several state-of-the-art baselines.

model

Requirements

We recommended the following dependencies.

import nltk
nltk.download()
> d punkt

Download data

Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy. The raw images can be downloaded from from their original sources here, here and here.

We follow bottom-up attention model and SCAN to obtain image features for fair comparison. More details about data pre-processing (optional) can be found here. All the precomputed image features data needed for reproducing the experiments in the paper, can be downloaded from SCAN by using:

wget https://scanproject.blob.core.windows.net/scan-data/data.zip

You can also get the data from google drive: https://drive.google.com/drive/u/1/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC.

Besides, we use bottom-up attention to extract the positions of detected boxes, including coordinate, width and height, which can be downloaded from https://drive.google.com/file/d/1K9LnWJc71dK6lF1BJMPlbkIu_vYmHjVP/view?usp=sharing

We refer to the path of extracted files as $DATA_PATH.

BERT model

We use the BERT code from BERT-pytorch. Please following here to convert the Google BERT model to a PyTorch save file $BERT_PATH.

Training new models

Run train.py:

For MSCOCO:

python train.py --data_path $DATA_PATH --bert_path $BERT_PATH --data_name coco_precomp --logger_name runs/coco --max_violation --num_epochs 40 --lr_update 20

For Flickr30K:

python train.py --data_path $DATA_PATH --bert_path $BERT_PATH --data_name f30k_precomp --logger_name runs/flickr --max_violation --num_epochs 30 --lr_update 10

Evaluate trained models

Modify the model_path and data_path in the evaluation_models.py file. Then Run evaluate_models.py:

python evaluate_models.py

To do cross-validation on MSCOCO 1K test set, pass fold5=True. Pass fold5=False for evaluation on MSCOCO 5K test set. Pretrained models can be downloaded from https://drive.google.com/drive/folders/16O9cqYDnQdLKHyiOUTUexih_yCfzTelW?usp=sharing.

Reference

@inproceedings{qu2020camera,
	title={Context-Aware Multi-View Summarization Network for Image-Text Matching},
	author={Qu, Leigang and Liu, Meng and Cao, Da and Nie, Liqiang and Tian, Qi},
	booktitle={Proceedings of the 28th ACM International Conference on Multimedia},
	pages={1-9},
	year={2020}
}

camera's People

Contributors

lgqu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.