Self-Supervised Temporal-Discriminative Representation Learning

The source code for our paper

"Self-Supervised Temporal-Discriminative Representation Learning for Video Action Recognition" paper

Overview

Without one label available, our method learn to focus on motion region powerful!

Our self-supervised VTDL signifcantly outperforms existing self-supervised learning method in video action recognition, even achieve better result than fully-supervised methods on UCF101 and HMDB51 when a small-scale video dataset (with only thousands of videos) is used for pre-training!

Requirements

Python3
pytorch1.1+
PIL

Structure

datasets
- list
  - hmdb51: the train/val lists of HMDB51
  - ucf101: the train/val lists of UCF101
  - kinetics-400: the train/val lists of kinetics-400
experiments
- logs: experiments record in detials
- TemporalDis
  - hmdb51
  - ucf101
  - kinetics
- gradientes:
- visualization
src
- data: load data
- loss: the loss evluate in this paper
- model: network architectures
- scripts: train/eval scripts
- TC: detail implementation of Spatio-temporal consistency
- utils
- feature_extract.py
- main.py
- trainer.py
- option.py

Dataset

Look dataset.md. Prepare dataset in txt file, and each row of txt is as below: The split of hmdb51/ucf101/kinetics-400 can be download from google driver.

Each item include

video_path class frames_num

VTDL

Network Architecture

The network is in the folder src/model/[backbone].py

Method	#logits_channel
C3D	512
R2P1D	2048
I3D	1024
R3D	2048

Step1: self-supervised learning

HMDB51

bash scripts/TemporalDisc/hmdb51.sh

UCF101

bash scripts/TemporalDisc/ucf101.sh

Kinetics-400

bash scripts/TemporalDisc/kinetics.sh

Notice: More Training Options and ablation study Can be find in scripts

Step2: Transfer to action recognition

HMDB51

#!/usr/bin/env bash
python main.py \
--method ft \
--train_list ../datasets/lists/hmdb51/hmdb51_rgb_train_split_1.txt \
--val_list ../datasets/lists/hmdb51/hmdb51_rgb_val_split_1.txt \
--dataset hmdb51 \
--arch i3d \
--mode rgb \
--lr 0.001 \
--lr_steps 10 20 25 30 35 40 \
--epochs 45 \
--batch_size 4 \
--data_length 64 \
--workers 8 \
--dropout 0.5 \
--gpus 2 \
--logs_path ../experiments/logs/hmdb51_i3d_ft \
--print-freq 100 \
--weights ../experiments/TemporalDis/hmdb51/models/04-16-2328_aug_CJ/ckpt_epoch_48.pth

UCF101

#!/usr/bin/env bash
python main.py \
--method ft \
--train_list ../datasets/lists/ucf101/ucf101_rgb_train_split_1.txt \
--val_list ../datasets/lists/ucf101/ucf101_rgb_val_split_1.txt \
--dataset ucf101 \
--arch i3d \
--mode rgb \
--lr 0.0005 \
--lr_steps 10 20 25 30 35 40 \
--epochs 45 \
--batch_size 4 \
--data_length 64 \
--workers 8 \
--dropout 0.5 \
--gpus 2 \
--logs_path ../experiments/logs/ucf101_i3d_ft \
--print-freq 100 \
--weights ../experiments/TemporalDis/ucf101/models/04-18-2208_aug_CJ/ckpt_epoch_45.pth

Notice: More Training Options and ablation study Can be find in scripts

Results

Step2:Transfer

With same experiment setting, the result is reported below:

Method	UCF101	HMDB51
Baseline	60.3	22.6
+ BA	63.3	26.2
+ Temporal Discriminative	72.7	41.2
+ TCA	82.3	52.9

trained models/logs/performance

We provided trained models/logs/performance in google driver.

Baseline + BA

performance;

trained_model;

logs

Baseline + BA + Temporal Discriminative

performance;

trained_model;

logs

Baseline + BA + Temporal Discriminative + TCA

(a). Pretrain

Loss curve:

Ins Prob:

pretrained_weight

This pretrained model can achieve 52.7% on HMDB51.

(b). Finetune

performance;

trained_model;

logs

The result is report with single video clip. In the test, we will average ten clips as final predictions. Will lead to around 2-3% improvement.

python test.py

Feature Extractor

As STCR can be easily extend to other video representation task, we offer the scripts to perform feature extract.

python feature_extractor.py

The feature will be saved as a single numpy file in the format [video_nums,features_dim]

Citation

Please cite our paper if you find this code useful for your research.

@Article{wang2020self,
  author  = {Jinpeng Wang and Yiqi Lin and Andy J. Ma and Pong C. Yuen},
  title   = {Self-supervised Temporal Discriminative Learning for Video Representation Learning},
  journal = {arXiv preprint arXiv:2008.02129},
  year    = {2020},
}

Others

The project is partly based on Unsupervised Embedding Learning and MOCO.

fingerrec / self-supervised-temporal-discriminative-representation-learning-for-video-action-recognition Goto Github PK