Code Monkey home page Code Monkey logo

temporal_context_aggregation's Introduction

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

By Jie Shao*, Xin Wen*, Bingchen Zhao and Xiangyang Xue (*: equal contribution)

This is the official PyTorch implementation of the paper "Temporal Context Aggregation for Video Retrieval with Contrastive Learning".

Introduction

In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features using the self-attention mechanism.

teaser

To train it on video retrieval datasets, we propose a supervised contrastive learning method that performs automatic hard negative mining and utilizes the memory bank mechanism to increase the capacity of negative samples.

The proposed method shows a significant performance advantage (∼17% mAP on FIVR-200K) over state-of-the-art methods with video-level features, and deliver competitive results with 22x faster inference time comparing with frame-level features.

Getting Started

Requirements

Currently, we only tested the code compacity with the following dependencies:

  • Python 3.7
  • PyTorch == 1.4.0
  • Torchvision == 0.5.0
  • CUDA == 10.1
  • Other dependencies

Installation

  • Clone this repo:
git clone https://github.com/xwen99/temporal_context_aggregation.git
cd temporal_context_aggregation
  • Install the dependencies:
pip install -r requirements.txt

Preparing the Data

  • Please follow the instructions in the pre-processing folder.

Training

  • Example training script on the VCDB dataset on an 8-gpu machine:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 horovodrun -np 8 -H localhost:8 \
python train.py \
--annotation_path datasets/vcdb.pickle \
--feature_path PATH/TO/YOUR/DATASET \
--model_path PATH/TO/YOUR/MODEL \
--num_clusters 256 \
--num_layers 1 \
--output_dim 1024 \
--normalize_input \
--neg_num 16 \
--epochs 40 \
--batch_sz 64 \
--learning_rate 1e-5 \
--momentum 0.9 \
--weight_decay 1e-4 \
--pca_components 1024 \
--padding_size 300 \
--num_readers 32 \
--num_workers 1 \
--moco_k 4096 \
--moco_m 0. \
--moco_t 1.0 \
--print-freq 1 \
--use-adasum \
--fp16-allreduce \

Evaluation

  • Example evaluation script on the FIVR-5K dataset:
python3 evaluation.py \
--dataset FIVR-5K \
--pca_components 1024 \
--num_clusters 256 \
--num_layers 1 \
--output_dim 1024 \
--padding_size 300 \
--metric cosine \
--model_path PATH/TO/YOUR/MODEL \
--feature_path PATH/TO/YOUR/DATASET \

Acknowledgement

Our codebase builds upon several existing publicly available codes. Specifically, we have modified and integrated the following repos into this project:

Citation

@InProceedings{Shao_2021_WACV,
    author    = {Shao, Jie and Wen, Xin and Zhao, Bingchen and Xue, Xiangyang},
    title     = {Temporal Context Aggregation for Video Retrieval With Contrastive Learning},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {3268-3278}
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact

Xin Wen: [email protected]

temporal_context_aggregation's People

Contributors

xwen99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

temporal_context_aggregation's Issues

Computer with only one GPU

Hi,thanks for your work,In your paper,you used 8 GPUs to train the network. I have only one GPU on my computer, So, how can I complete the train process with your code, that is how to run your code without using the horovod module. By the way, I want to know can Windows run the training code?

could you open source the model file

Hi, thanks for your work! We have tried to train the model based on your training code. However, we could not achieve the same metric as your paper. Could you open source your model file or give us more instructions? Thanks a lot!

Questions about evaluation

Hello, I'm trying to restore the code after reading the paper. I have a few questions about the evaluation. The FeatureDataset class that exists at data.py is called from evaluation.py. This part is missing, can you share it with me? And I want to know exactly what padding size means during the evaluation and how much should I set it to? Thank you.

Cannot achieve score writeen on paper

Hello. I've been trying to restore paper's score. However... i failed to achieve the same metric on your paper.

Here is the diffenerence between TCA and my tries.

  1. VCDB background dataset is a little bit different what u have. So, there is 80 videos missing in my extracted_vcdb_feature.
  2. There are no information about how randomly sampled frames while training PCA. Is it 10 frames correct written in github pre_processing code?
  3. In paper, Transformer model's dropout rate is set to 0.5. However, it is set to 0.2. (In train.py)
  4. In "evaluation.py", cosine similarity is not working. (Other's work) So, i used my own calculation code for cosine similarity.

I'm leaving the result of measuring the performance based on the thesis information.

python3 evaluation.py --dataset FIVR-5K --pca_components 1024 --num_clusters 256 --num_layers 1 --output_dim 1024 --padding_size 300 --metric sym_chamfer --model_path models/model_v5_with_all_bg.pth --feature_path pre_processing/fivr_imac_pca1024.hdf5

===== FIVR-5K Dataset =====
Queries: 50 videos
Database: 5049 videos
----------------
DSVR mAP: 0.8029
CSVR mAP: 0.7893
ISVR mAP: 0.7040
python evaluation_org.py --dataset FIVR-5K --pca_components 1024 --num_cluster 256  --num_layer 1 --output_dim 1024 --padding_size 300  --metric cosine --model_path models/model_v5_with_all_bg.pth  --feature_path pre_processing/fivr_imac_pca1024.hdf5 --random_sampling

========================== mAP ==========================

        mAP@1      mAP@10     mAP@100    mAP@200      mAP
----  ---------  ---------  ---------  ---------  ---------
DSVR     0.9400     0.9230     0.7731     0.7382     0.5761
CSVR     0.9400     0.9339     0.7828     0.7414     0.5618
ISVR     0.9800     0.9701     0.8087     0.7525     0.4970

I really wanted to get the same metric on your paper. Please let me know which one is different to your work. Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.