3D dense residual network for action recognition
Limited by hardware(I only have one GTX1080 Ti) and network(CN), I did not do further experiments with Large datastes, e.g Kinetics, sports-8M.
Inspired by Residual Dense Network for Image Super-Resolution
fig1 3D dense residual block
fig2 3D dense residual network
The parameters and model size of 3D-DRN as follow.
parameters | model size |
---|---|
1.5M | 6.3MB |
opencv3.2
keras2.0.8
tensorflow1.3
setp1 -- download UCF-101 dataset
step2 -- converting videos to images for UCF-101
python utils/video2img.py --video-path='the path of ucf101' --save-path='the path for saving images'
step3 -- generating label txt for converted images
python utils/make_label_txt.py --image-path='the path of saved images'
In C3D, the input dimensions are 128 × 171 × 16 × 3, in this repo are 128 × 171 × 8 × 3.
During trianing, support three types of length for input clips. check this script for detail.
(1) clip length = 16. I take one sample each two frames.
(2) clip length = 24. I take one sample each three frames.
(3) mixed clip lengths. First, I randomly choose 16 or 24 clip length with 50% probability, then take one sample each two or three frames correspondingly.
Clips are resized to have a frame size of 128 × 171. On training, I randomly crop input clips into 112×112×8 crops for spatial and temporal jittering. I also horizontally flip them with 50% probability.
python train_DRN-3D.py --lr=0.01 --batch-size=16 --drop-rate=0.2 --clip-length=16 --random-length=False --image-path='the path of saved images'
I use only a single center crop per clip, and pass it through the network to make the clip prediction. For video predictions, I average clip predictions of some clips which are evenly extracted from the video (no overlap).
Evaluate video (pre-trained weight files are in 'results' directory )
python evaluate_video.py
Results on UCF101
clip length | clip acc | video acc |
---|---|---|
16 | 58.41% | 62.80% |
24 | 59.47% | 64.16% |
16, 24 mixed | 59.60% | 64.76% |
fig3 clip acc of length=16 during training
fig4 clip loss of length=16 during training
python video_demo.py
Extract video feature for HMDB51 with pre-trained model
Firstly, convert video to image
python utils/video2img_hmdb.py
Secondly, generate label txt
python utils/hmdb_label.py
Extract video feature and evaluate them
python evaluate_hmdb.py