Code Monkey home page Code Monkey logo

strg's Introduction

Videos as Space-Time Region Graph

Summary

  • This repository is for testing the idea of the following paper:

Wang, Xiaolong, and Abhinav Gupta. "Videos as space-time region graphs." Proceedings of the European conference on computer vision (ECCV). 2018.

  • It means that it may contain several mismatch with the original implementation introduced on the paper.

  • Also the performance is much lower than the publication (24 vs 43) and I never test Kinetics pre-trained ResNet-50-I3D.

Notes

  • This repository is based on https://github.com/kenshohara/3D-ResNets-PyTorch.

  • The architecture of ResNet-50-I3D in the paper is different from that in the above repository. I did not use Kinetics pre-trained model but use ImageNet pre-trained model.

  • Currently, RPN is used on every iteration which requires approximately 3 times more training time.

  • Kinetics pre-trained model can be found in here.

Requirements

conda install pytorch torchvision cudatoolkit=10.1 -c soumith
pip install -r requirements.txt
  • FFmpeg, FFprobe

  • Python 3

Preparation

Kinetics

  • Download videos using the official crawler.
    • Locate test set in video_directory/test.
  • Convert from avi to jpg files using util_scripts/generate_video_jpgs.py
python -m util_scripts.generate_video_jpgs mp4_video_dir_path jpg_video_dir_path kinetics
  • Generate annotation file in json format similar to ActivityNet using util_scripts/kinetics_json.py
    • The CSV files (kinetics_{train, val, test}.csv) are included in the crawler.
python -m util_scripts.kinetics_json csv_dir_path 700 jpg_video_dir_path jpg dst_json_path

Something-Something v1/v2

  • Download videos from the official website.
  • For Something-Something v2, please run util_scripts/vid2img_sthv1.[py
python util_scripts/sthv1_json.py 'data/something/v1' 'data/something/v1/img' 'data/sthv1.json'
python util_scripts/sthv2_json.py 'data/something/v2' 'data/something/v2/img' 'data/sthv2.json'

Running the code

Data Path

Assume the structure of data directories is the following:

~/
  data/
    something/
      v1/
        img/
          .../ (directories of video names)
            ... (jpg files)
      v2/
        img/
          .../ (directories of video names)
            ... (jpg files)
    kinetics_videos/
      jpg/
        .../ (directories of class names)
          .../ (directories of video names)
            ... (jpg files)
    results/
      save_100.pth
    kinetics.json

Confirm all options.

python main.py -h

Kinetics Pre-training

Train ResNets-50 on the Kinetics-700 dataset (700 classes) with 4 CPU threads (for data loading).
Batch size is 128.
Save models at every 5 epochs. All GPUs is used for the training. If you want a part of GPUs, use CUDA_VISIBLE_DEVICES=....

python main.py --root_path ~/data --video_path kinetics_videos/jpg --annotation_path kinetics.json \
--result_path results --dataset kinetics --model resnet \
--model_depth 50 --n_classes 700 --batch_size 128 --n_threads 4 --checkpoint 5

Calculate top-5 class probabilities of each video using a trained model (~/data/results/save_200.pth.)
Note that inference_batch_size should be small because actual batch size is calculated by inference_batch_size * (n_video_frames / inference_stride).

python main.py --root_path ~/data --video_path kinetics_videos/jpg --annotation_path kinetics.json \
--result_path results --dataset kinetics --resume_path results/save_200.pth \
--model_depth 50 --n_classes 700 --n_threads 4 --no_train --no_val --inference --output_topk 5 --inference_batch_size 1

Evaluate top-1 video accuracy of a recognition result (data/results/val.json).

python -m util_scripts.eval_accuracy data/sthv2.json data/results/val.json --subset val -k 1 --ignore

Something-Something-v1

First of all, we need to train backbone network (ResNet-50-I3D) for 100 epochs with learning rate as 0.00125 (decayed at 90 epoch to 0.000125) The original batchsize is 8 but in this implementation, we use 32 to reduce the training time.

python main.py --root_path data --video_path data/something/v1/img --annotation_path sthv1.json \
--result_path resnet_strg_imgnet_bs32 --dataset somethingv1 --n_classes 174 --n_pretrain_classes 700 \
--ft_begin_module fc --tensorboard --wandb --conv1_t_size 5 --learning_rate 0.00125 --sample_duration 32 \
--n_epochs 100 --multistep_milestones 90 --model resnet_strg --model_depth 50 --batch_size 32 \
--n_threads 8 --checkpoint 1

Then, we need to train with GCN module until 30 epochs with learning rate as 0.000125.

python main.py --root_path data --video_path data/something/v1/img --annotation_path sthv1.json \
--result_path resnet_strg_imgnet_32_gcn --dataset somethingv1 --n_classes 174 --n_pretrain_classes 174 \
--ft_begin_module fc --tensorboard --wandb --conv1_t_size 5  --learning_rate 0.000125 \
--sample_duration 32 --n_epochs 30 --model resnet_strg --model_depth 50 --batch_size 32 \
--nrois 10 --det_interval 2 --strg \
--n_threads 8 --checkpoint 1 --pretrain_path resnet_strg_imgnet_bs32/save_100.pth

Results on Something-Something-v1

The published results

Model name ResNet-50-I3D ResNet-50-I3D + STRG
Top-1 Accuracy 41.6% 43.3%

This repo results (without using Kinetic pretraining model)

Model name ResNet-50-I3D ResNet-50-I3D + STRG
Top-1 Accuracy 23.2% 24.5%

strg's People

Contributors

jd730 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

strg's Issues

About pre-trained model

Hi, could you provide me with your pre-trained models for the datasets noted in results? Thank you very much!

About your dataset

Hi, because I'm a green hands and I could not get Kinetics dataset, I only can read your code. So there are some questions:
(1) In videodataset.py, class videodataset return a clip and a target in training. I notice that the length of clip is equal to the length of frame_indices which is 10, but in your paper, you select 32 frames as input. So could you tell me where you select 32 frames?
(2)About strg.py, I test other size of input like 1332224224, whose batch size is 1 and depth is 32, but the batch size of output of extractor and reducer is 2. If that means I must process all my input with batch_size = 4? Could I use other size of input?
All problems above are just primary but bothering me for several days. I would be very grateful to you if you could help me.
Thanks.

hi , query about code and data ╥﹏╥...

hi,

I think the movie data should be processed by me , that true ? For example , I must to get image from video...

I want to know the code of

self.roi_align = RoIAlign((roi_size,roi_size), 1/8, -1, aligned=True)

why 1/8 and -1 ?

hope to get your reply (●'◡'●)

Best wishes,
jun0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.