Code Monkey home page Code Monkey logo

ltc's Introduction

Long-term Temporal Convolutions (LTC)

GĂźl Varol, Ivan Laptev and Cordelia Schmid, Long-term Temporal Convolutions for Action Recognition, PAMI 2018.

[Project page] [arXiv]

Preparation

1. Install Torch with cuDNN support.

2. Download UCF101 and/or HMDB51 datasets.

3. Data pre-processing.

Find a few pre-processing scripts under preprocessing/ directory. The contents of the datasets directory is explained here.

preprocess_flow.lua and preprocess_rgb.lua create the .t7 files. This was to accelerate data loading when libraries such as torch-opencv did not exist. Now you could modify the loading function to directly read from .mp4 files as in here. We don't recommend to do this pre-processing anymore.

We had some redundant files that we created with generate_sliding_test_clips.m to have a list of sliding windows.

We extracted the Brox flow with this code at that time. This produces the *_minmax.txt files that save the min/max flow values to be able to go back to the original flow values and to still use jpeg compression.

4. C3D model in Torch.

See the c3dtorch directory to find the scripts that are used to convert C3D model in caffe to torch, as well as converting the mean files.

The c3d.t7 (305MB) model file is produced by running convert_c3d_caffe2torch.lua with the modified version of loadcaffe that is provided. Special thanks to Sergey Zagoruyko for the help.

The file sport1m_train16_128_mean.binaryproto is converted to data/volmean_sports1m.t7 using the caffemodel2json tool from Vadim Kantorov. create_mean_files.lua further creates mean files used for 58x58 and 71x71 resolutions. We subtract them from each input frame.

Some scripts that are used to prepare experiments with different spatial (58x58, 71x71) and temporal (20, 40, 60, 80, 100) resolutions are also provided for convenience. Additional layers on C3D were trained on UCF101 with 16f input. These layers were then attached to the pre-trained C3D with modifications in fc6 layer (with scripts convert_c3d_varyingtemp_ucf_58.lua and convert_c3d_varyingtemp_ucf_71.lua). Finally these networks were fine-tuned end-to-end for different resolutions to obtain RGB stream results on UCF101 dataset.

Running the code

You can simply run th main.lua to start a training with the default parameters. Following are several examples on how to set parameters in different scenarios:

#### From scratch experiments on UCF101
# Run with default parameters (UCF101 dataset, split 1, 100-frame 58x58 resolution flow network with 0.9 dropout)
th  main.lua -expName flow_100f_d9

# Continue training from epoch 10
th  main.lua -expName flow_100f_d9 -continue -epochNumber 10

# Test final prediction accuracy for model number 20
th  main.lua -expName flow_100f_d9 -evaluate -modelNo 20

# Train 100-frame RGB network from scratch on UCF101 dataset
th  main.lua -nFrames 100 -loadHeight 67  -loadWidth 89  -sampleHeight 58  -sampleWidth 58  \
-stream rgb  -expName rgb_100f_d5  -dataset UCF101 -dropout 0.5 -LRfile LR/UCF101/rgb_d5.lua

# Train 71x71 spatial resolution flow network
th  main.lua -nFrames 100 -loadHeight 81  -loadWidth 108 -sampleHeight 71  -sampleWidth 71  \
-stream flow -expName flow_100f_d5 -dataset UCF101 -dropout 0.5 -LRfile LR/UCF101/flow_d5.lua

# Train 16-frame 112x112 spatial resolution flow network
th  main.lua -nFrames 16  -loadHeight 128 -loadWidth 171 -sampleHeight 112 -sampleWidth 112 \
-stream flow -expName flow_100f_d5 -dataset UCF101 -dropout 0.5 -LRfile LR/UCF101/flow_d5.lua

#### Fine-tune HMDB51 from UCF101
# Train the last layer and freeze the lower layers
th main.lua -expName flow_100f_58_d9/finetune/last             \
-loadHeight 67 -loadWidth 89 -sampleHeight 58 -sampleWidth 58  \
-dataset HMDB51                                                \
-LRfile LR/HMDB51/flow_d9_last.lua                             \
-finetune last                                                 \
-retrain log/UCF101/flow_100f_58_d9/model_50.t7

# Fine-tune the whole network
th main.lua -expName flow_100f_58_d9/finetune/whole            \
-loadHeight 67 -loadWidth 89 -sampleHeight 58 -sampleWidth 58  \
-dataset HMDB51                                                \
-LRfile LR/HMDB51/flow_d9_whole.lua                            \
-finetune whole                                                \
-lastlayer log/HMDB51/flow_100f_58_d9/finetune/last/model_3.t7 \
-retrain log/UCF101/flow_100f_58_d9/model_50.t7

Note that the results are sensitive to the learning rate (LR) schedule. You can set your own LR by writing a -LRfile. Following are a few observations that can be useful:

  • RGB networks converge faster than flow networks.
  • High dropout takes longer to converge.
  • HMDB51 dataset trains faster.
  • Fewer number of frames trains faster.

Pre-trained models

We provide the 71x71 RGB networks that are used for the final results. 60f + 100f was used. We provide the initialization with 16f for convenience. You can find the download links under models/download_pretrained_rgb_models.sh. See here, for mean files. If you need other resolutions, please send an e-mail.

IDT features

You can find the results of Fisher Vector encoding of the improved dense trajectory features under the IDT/ directory.

Citation

If you use this code, please cite the following:

@ARTICLE{varol18_ltc,  
  title   = {Long-term Temporal Convolutions for Action Recognition},  
  author  = {Varol, G{\"u}l and Laptev, Ivan and Schmid, Cordelia},  
  journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},  
  year    = {2018},  
  volume  = {40},  
  number  = {6},  
  pages   = {1510--1517},  
  doi     = {10.1109/TPAMI.2017.2712608}  
}

Acknowledgements

This code is largely built on the ImageNet training example https://github.com/soumith/imagenet-multiGPU.torch by Soumith Chintala.

ltc's People

Contributors

gulvarol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ltc's Issues

Error while running main.lua

Hello,
Your work is detailed and the effort you have put into this is awesome. I am trying to reproduce this work for a mini project. I am experimenting on UCF101 dataset and have generated upto the .t7 files for flow.

I am quite unclear about how splits work and while running main.lua, I am experiencing an error repeatedly. I am relatively new to lua/torch environment and hence it would be great if you could help me on this issue.

I have attached the screenshot for your reference
screenshot from 2018-10-17 11-47-24

Thanks in advance

the input channel order

I am not sure about the input channel order.the input size is 3,16,112*112(three channels),but channel order should be RGB?or BGR?

convert_model_caffe_torch

Thanks for your nice code and wonderful works! @gulvarol
I noticed that you converted the C3D caffemodel into torch model. Could you please share the code that
convert the torch into caffe and the the code that convert caffe into torch?
Thank you in advance!

Pretrained Model on Sportm 1 million

@gulvarol
Thanks for your amazing work. This greatly improve the performance for action recognition. Can you publish your pre-trained model on Sport 1 million? It would be a great help. Many Thanks!

Issue with Dataset Directory

Hi, I am trying to run your model to understand it in more detail. but I am facing issue with Dataset Directory. I have generated train/test splits by executing matlab script (splits.m). Now I want to execute preprocess_flow.lua and preprocess_rgb.lua scripts but base directory mentioned in these scripts are not there in datasets folder. Please guide me how can I get '/home/gvarol/datasets/UCF101/flow/jpg/' and '/home/gvarol/datasets/UCF101/rgb/jpg/'?

Extraction of flow based features

Hi, I am trying to understand your code line by line but I did not work with lua torch so I am facing some difficulties to execute it. I am creating my model in tensor flow and seeking some pre-processing guidance from your model. Just like your work, I have also extracted flow-x and flow-y using Brox algorithm code (that your shared). The obtained values of flow-x and flow-y were in -ive to +ive range which have been shifted to 0-255. but I when I subtract mean values from each frame(as you mentioned in your work), I got -ive values again. My point of concern is that in CNN model ReLu is used as activation function that ignores -ive values which means that after all pre-processing my flow features should have only positive values but after subtracting mean values I got -ive values also for flow-x and flow-y which I stored as '.npy'. Can you please share which range of values you had in your final flow features before passing to CNN (I have no understanding of lua torch and thus don't know how to see the content of t7 files).

Need Network architecture explanation

I am new to this field and wanted an explanation of what network architecture has been used in the given paper. The paper explains the architecture but it is not completely understandable.

What does the C1.........Ck mean in the below-given image(taken from the paper) and what does the yellow arrow signify and how did the C1.....Ck is getting reduced in each subsequent layer?

Screen Shot 2019-03-17 at 11 18 18 PM

Can you help me with gaining the understanding of how a given video will convolute with the network and what different dimensions of the feature maps are and some other details related to the architecture?

I seek an explanation rather than the code.

Splitting channels with t frames

For example, I have a 210 frames video and t = 20 then input is created such that each input is made of 58x58x20 but the last one will be of 58x58x10. Could you tell me how was this case considered?

What does the paper mean by different temporal extent being used on 60f network?

I was reading the paper Long-term Temporal Convolutions for Action Recognition and read that they have tried different temporal extent t ∈{20,40,60,80,100} on the 60f Network.

I didn't get the term temporal extent used here. Can you also explain what does 60f network mean?

From this link I got to know that a video is made up of many clips and each clip is of some x frames. Does that hold true in this paper too?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.