Context-Aware Multi-View Summarization Network for Image-Text Matching(CAMERA)

PyTorch code of the paper "Context-Aware Multi-View Summarization Network for Image-Text Matching". It is built on top of VSRN and SAEM.

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. "Context-Aware Multi-View Summarization Network for Image-Text Matching", ACM MM, 2020. [pdf]

Introduction

Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multi-view summarization network to summarize context-enhanced visual region information from multiple views. To be more specific, we design an adaptive gating self-attention module to extract representations of visual regions and words. By controlling the internal information flow, we are able to adaptively capture context information. Afterwards, we introduce a summarization module with a diversity regularization to aggregate region-level features into image-level ones from different perspectives. Ultimately, we devise a multi-view matching scheme to match multi-view image features with corresponding text ones. To justify our work, we have conducted extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, which demonstrates the superiority of our model as compared to several state-of-the-art baselines.

Requirements

We recommended the following dependencies.

Python 2.7
PyTorch (0.4.1)
NumPy (>1.12.1)
TensorBoard
pycocotools
torchvision
matplotlib
Punkt Sentence Tokenizer:

import nltk
nltk.download()
> d punkt

Download data

Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy. The raw images can be downloaded from from their original sources here, here and here.

We follow bottom-up attention model and SCAN to obtain image features for fair comparison. More details about data pre-processing (optional) can be found here. All the precomputed image features data needed for reproducing the experiments in the paper, can be downloaded from SCAN by using:

wget https://scanproject.blob.core.windows.net/scan-data/data.zip

You can also get the data from google drive: https://drive.google.com/drive/u/1/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC.

Besides, we use bottom-up attention to extract the positions of detected boxes, including coordinate, width and height, which can be downloaded from https://drive.google.com/file/d/1K9LnWJc71dK6lF1BJMPlbkIu_vYmHjVP/view?usp=sharing

We refer to the path of extracted files as $DATA_PATH.

BERT model

We use the BERT code from BERT-pytorch. Please following here to convert the Google BERT model to a PyTorch save file $BERT_PATH.

Training new models

Run train.py:

For MSCOCO:

python train.py --data_path $DATA_PATH --bert_path $BERT_PATH --data_name coco_precomp --logger_name runs/coco --max_violation --num_epochs 40 --lr_update 20

For Flickr30K:

python train.py --data_path $DATA_PATH --bert_path $BERT_PATH --data_name f30k_precomp --logger_name runs/flickr --max_violation --num_epochs 30 --lr_update 10

Evaluate trained models

Modify the model_path and data_path in the evaluation_models.py file. Then Run evaluate_models.py:

python evaluate_models.py

To do cross-validation on MSCOCO 1K test set, pass fold5=True. Pass fold5=False for evaluation on MSCOCO 5K test set. Pretrained models can be downloaded from https://drive.google.com/drive/folders/16O9cqYDnQdLKHyiOUTUexih_yCfzTelW?usp=sharing.

Reference

@inproceedings{qu2020camera,
	title={Context-Aware Multi-View Summarization Network for Image-Text Matching},
	author={Qu, Leigang and Liu, Meng and Cao, Da and Nie, Liqiang and Tian, Qi},
	booktitle={Proceedings of the 28th ACM International Conference on Multimedia},
	pages={1-9},
	year={2020}
}

huangjh98 / camera Goto Github PK

camera's Introduction

Context-Aware Multi-View Summarization Network for Image-Text Matching(CAMERA)

Introduction

Requirements

Download data

BERT model

Training new models

Evaluate trained models

Reference

camera's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent