Code Monkey home page Code Monkey logo

rmn's Introduction

Learning to Discretely Compose Reasoning Module Networks for Video Captioning (IJCAI2020)

Introduction

In this paper, we propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with reasoning capacity. Specifically, our RMN employs 1) three sophisticated spatio-temporal reasoning modules, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation. This code is the Pytorch implementation of our work. image

Dependencies

  • Python 3.7 (other versions may also work)
  • Pytorch 1.1.0 (other versions may also work)
  • pickle
  • tqdm
  • h5py
  • matplotlib
  • numpy
  • tensorboard_logger

Prepare

  1. Create two empty folders, data and results
  2. Download visual and text features of MSVD and MSR-VTT, and put them in data folder.
  3. Download pre-trained models msvd_model and msr-vtt_model, and put them in results folder.

Download instruction (#1): 1. enter the folder, 2. choose all files, 3. download.

Evaluation

We provide the pre-trained models of "RMN(H+L)" in the paper to reproduce the result reported in paper. Note that because the MSVD dataset is too small, the training result is not stable, so the final result of MSVD in the paper is the average of three training results.

Metrics MSVD MSR-VTT
BLEU@4 56.4 42.5
METEOR 37.2 28.4
ROUGE-L 74.0 61.6
CIDEr 97.8 49.6

Evaluation command example:

python evaluate.py --dataset=msr-vtt --model=RMN \
 --result_dir=results/msr-vtt_model \
 --use_loc --use_rel --use_func \
 --hidden_size=1300 --att_size=1024 \
 --test_batch_size=2 --beam_size=2 \
 --eval_metric=CIDEr

Training

You can also train you own model by running Training command example:

python train.py --dataset=msr-vtt --model=RMN \
 --result_dir=results/msr-vtt_model --use_lin_loss \
 --learning_rate_decay --learning_rate_decay_every=5 \
 --learning_rate_decay_rate=3 \
 --use_loc --use_rel --use_func --use_multi_gpu \
 --learning_rate=1e-4 --attention=gumbel \
 --hidden_size=1300 --att_size=1024 \
 --train_batch_size=32 --test_batch_size=8

You can also add --use_multi_gpu to train the model with multiply GPUs.

Sampleing

Sampleing command example:

python sample.py --dataset=msr-vtt --model=RMN \
 --result_dir=results/msr-vtt_model \
 --use_loc --use_rel --use_func \
 --hidden_size=1300 --att_size=1024 \
 --eval_metric=CIDEr

By running this command, you can get the pie chart in the paper. And when uncommenting the visualization code in sample.py, you can visualize the module selection process.

Video Captioning Papers

This repository contains a curated list of research papers in Video Captioning(from 2015 to 2020). Link to the code and project website if available.

Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry.

@inproceedings{tan2020learning,
title={Learning to Discretely Compose Reasoning Module Networks for Video Captioning},
author={Tan, Ganchao and Liu, Daqing and Wang Meng and Zha, Zheng-Jun},
booktitle={IJCAI-PRICAI},
year={2020}
}

rmn's People

Contributors

tgc1997 avatar daqingliu avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.