xwen99 / temporal_context_aggregation Goto Github PK

View Code? Open in Web Editor NEW

27.0 3.0 3.0 5.33 MB

Temporal Context Aggregation for Video Retrieval with Contrastive Learning, WACV 2021

Home Page: https://arxiv.org/abs/2008.01334

License: Apache License 2.0

Python 100.00%

video-retrieval contrastive-learning wacv2021 temporal-context-aggregation representation-learning

temporal_context_aggregation's Introduction

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

By Jie Shao*, Xin Wen*, Bingchen Zhao and Xiangyang Xue (*: equal contribution)

This is the official PyTorch implementation of the paper "Temporal Context Aggregation for Video Retrieval with Contrastive Learning".

Introduction

In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features using the self-attention mechanism.

To train it on video retrieval datasets, we propose a supervised contrastive learning method that performs automatic hard negative mining and utilizes the memory bank mechanism to increase the capacity of negative samples.

The proposed method shows a signiﬁcant performance advantage (∼17% mAP on FIVR-200K) over state-of-the-art methods with video-level features, and deliver competitive results with 22x faster inference time comparing with frame-level features.

Getting Started

Requirements

Currently, we only tested the code compacity with the following dependencies:

Python 3.7
PyTorch == 1.4.0
Torchvision == 0.5.0
CUDA == 10.1
Other dependencies

Installation

Clone this repo:

git clone https://github.com/xwen99/temporal_context_aggregation.git
cd temporal_context_aggregation

Install the dependencies:

pip install -r requirements.txt

Preparing the Data

Please follow the instructions in the pre-processing folder.

Training

Example training script on the VCDB dataset on an 8-gpu machine:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 horovodrun -np 8 -H localhost:8 \
python train.py \
--annotation_path datasets/vcdb.pickle \
--feature_path PATH/TO/YOUR/DATASET \
--model_path PATH/TO/YOUR/MODEL \
--num_clusters 256 \
--num_layers 1 \
--output_dim 1024 \
--normalize_input \
--neg_num 16 \
--epochs 40 \
--batch_sz 64 \
--learning_rate 1e-5 \
--momentum 0.9 \
--weight_decay 1e-4 \
--pca_components 1024 \
--padding_size 300 \
--num_readers 32 \
--num_workers 1 \
--moco_k 4096 \
--moco_m 0. \
--moco_t 1.0 \
--print-freq 1 \
--use-adasum \
--fp16-allreduce \

Evaluation

Example evaluation script on the FIVR-5K dataset:

python3 evaluation.py \
--dataset FIVR-5K \
--pca_components 1024 \
--num_clusters 256 \
--num_layers 1 \
--output_dim 1024 \
--padding_size 300 \
--metric cosine \
--model_path PATH/TO/YOUR/MODEL \
--feature_path PATH/TO/YOUR/DATASET \

Acknowledgement

Our codebase builds upon several existing publicly available codes. Specifically, we have modified and integrated the following repos into this project:

Citation

@InProceedings{Shao_2021_WACV,
    author    = {Shao, Jie and Wen, Xin and Zhao, Bingchen and Xue, Xiangyang},
    title     = {Temporal Context Aggregation for Video Retrieval With Contrastive Learning},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {3268-3278}
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact

Xin Wen: [email protected]

temporal_context_aggregation's People

Contributors

Stargazers

Watchers

Forkers

insightcs huangjh98 arun-george-zachariah

temporal_context_aggregation's Issues

Ask for the access of pretrained model

Computer with only one GPU

Hi，thanks for your work，In your paper，you used 8 GPUs to train the network. I have only one GPU on my computer, So, how can I complete the train process with your code, that is how to run your code without using the horovod module. By the way, I want to know can Windows run the training code?

could you open source the model file

Hi, thanks for your work! We have tried to train the model based on your training code. However, we could not achieve the same metric as your paper. Could you open source your model file or give us more instructions? Thanks a lot!

Questions about evaluation

Hello, I'm trying to restore the code after reading the paper. I have a few questions about the evaluation. The FeatureDataset class that exists at data.py is called from evaluation.py. This part is missing, can you share it with me? And I want to know exactly what padding size means during the evaluation and how much should I set it to? Thank you.

VCDB background dataset is too large. How can I modify the vcdb.pickle?

The background dataset is very large (~1.1TB). I have to use the core dataset without background dataset to train the model. How can I modify the annotation file (vcdb.pickle)?

Cannot achieve score writeen on paper

Hello. I've been trying to restore paper's score. However... i failed to achieve the same metric on your paper.

Here is the diffenerence between TCA and my tries.

VCDB background dataset is a little bit different what u have. So, there is 80 videos missing in my extracted_vcdb_feature.
There are no information about how randomly sampled frames while training PCA. Is it 10 frames correct written in github pre_processing code?
In paper, Transformer model's dropout rate is set to 0.5. However, it is set to 0.2. (In train.py)
In "evaluation.py", cosine similarity is not working. (Other's work) So, i used my own calculation code for cosine similarity.

I'm leaving the result of measuring the performance based on the thesis information.

python3 evaluation.py --dataset FIVR-5K --pca_components 1024 --num_clusters 256 --num_layers 1 --output_dim 1024 --padding_size 300 --metric sym_chamfer --model_path models/model_v5_with_all_bg.pth --feature_path pre_processing/fivr_imac_pca1024.hdf5

===== FIVR-5K Dataset =====
Queries: 50 videos
Database: 5049 videos
----------------
DSVR mAP: 0.8029
CSVR mAP: 0.7893
ISVR mAP: 0.7040

python evaluation_org.py --dataset FIVR-5K --pca_components 1024 --num_cluster 256  --num_layer 1 --output_dim 1024 --padding_size 300  --metric cosine --model_path models/model_v5_with_all_bg.pth  --feature_path pre_processing/fivr_imac_pca1024.hdf5 --random_sampling

========================== mAP ==========================

        mAP@1      mAP@10     mAP@100    mAP@200      mAP
----  ---------  ---------  ---------  ---------  ---------
DSVR     0.9400     0.9230     0.7731     0.7382     0.5761
CSVR     0.9400     0.9339     0.7828     0.7414     0.5618
ISVR     0.9800     0.9701     0.8087     0.7525     0.4970

I really wanted to get the same metric on your paper. Please let me know which one is different to your work. Thanks a lot.