roshanrane / prednet-and-predictive-coding-a-critical-review Goto Github PK

Performing video classificaiton by using the predictive processing architecture. The model is trained in a self-supervised manner to predict the next frames in videos along with the supervised video action classification task.

Jupyter Notebook 98.13% Python 1.86% Shell 0.01%

prednet-and-predictive-coding-a-critical-review's Introduction

rane

prednet-and-predictive-coding-a-critical-review's People

Contributors

Stargazers

Watchers

Forkers

zerghamahmed shahjahan0275

prednet-and-predictive-coding-a-critical-review's Issues

Unsupervised Learning of Video Representations using LSTMs(https://arxiv.org/pdf/1502.04681.pdf)

Datasets

Research about the following datasets :

something-something-v2
UCF-101
HMDB-51
Sports-1M
YouTube-8M
KITTI dataset

Things to find out for all the datasets :

Total videos
fps and the resolution of each frame of the videos
Check out few videos and list anything peculiar (like changing cameras within a video).

action modulation - Afa Prednet (https://arxiv.org/pdf/1804.03826.pdf)

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification(https://arxiv.org/pdf/1504.01561.pdf)

YouTube-8M: A Large-Scale Video Classification

Benchmark by google research

Read and summarize "Deep multi-scale video prediction beyond mean square error"

Key points :

Facebook paper from LeCun (reliable)
Point out the problems with L2 loss for video prediction
Try different loss functions and its combinations - (a) l1 (b) l2 (c) Gradient difference loss/GDL (d) adversarial loss
try different accuracy metrics - PSNR and SSIM

Results :
https://cs.nyu.edu/~mathieu/iclr2016.html

Adversarial+GDL loss gives best results visually (my opinion) and in best PSNR scores
Models with Optical flow - best SSIM scores.

PredNet 2017

Large-scale Video Classification with Convolutional Neural Networks, Andrew Karpathy, 2014

20bn paper : Fine-grained Video Classification and Captioning

Problem statement -

Build a benchmark NN architecture for the 20bn something-something-v2 dataset.
Task : Fine-grained action classification and video captioning.
Test : Transfer learning ability on kitchen-action dataset improved because of fine-grained labels.

Dataset -

220,000 videos
174 fine-grained actions
frame rate - 12fps
Pre-processing - (a) if number of frames < 48, replicate the first and last frames. (b) Resize the frames to 128×128 (c) random cropping of size 96×96 for testing use center crop instead

Architecture-

see Fig 1, Table 2

Encoder-decoder architecture
Trained end-to-end
Encoder - 2 channels of CNN (2D and 3D) + bi-LSTM RNN for video encoding
shared features for classification and captioning. loss is a weighed combination of the 2 tasks (page 5, equation 2). Best lambda = 0.1.
Decoder for captioning is 2-layer LSTM, for classification is a 2-layer MLP

Baseline-

VGG16 ImageNet pre-trained + 2 more layers of LSTM to integrate over 48 frames = 31.7 %

Experiments and results-

Different combinations of number of channels in 2D CNN and 3D CNN (Table 2) - both equal with 256 channels shows best results on classification task = 50.5%
50 labels (action groups) v/s 174 fine-grained action labels (Table 3) = Fine-grained better accuracy by 5%
Captioning -
Preprocessing - (a) converted fine-grained (Eg. blue cap, men's short sleeve shirt) captions to simplified captions (Eg. cap, shirt) (b) replaced words occuring < 5 times by [Something]
Baseline - Replace all occurrences of [something] with the most likely object string conditioned on that action class from classifier with simplified captions. Exact-Match accuracy = 5.7%.
Results (Table 5) Exact-Match accuracy = 8.63%, classification improved slightly to =51.3%
Transfer learning -
On a small dataset "kitchenware" with 13 labels and 390 videos.
Procedure - Freeze each pre-trained model and then train a classifier (logistic regression, MLP, LSTM) on top. The pre-trained model produce 12 vectors per sec of the video.
Results(Fig 4) 10-shot training on kitchenware + pre-trained on fine-grained captions and fine-grained action class show best result = 63.5%

Read and summarize older paper on "Deep Predictive Coding Networks"

Read and summarize older paper on "Deep Predictive Coding Networks" https://arxiv.org/pdf/1301.3541.pdf

UNSUPERVISED LEARNING OF VISUAL STRUCTURE USING PREDICTIVE GENERATIVE NETWORKS : https://arxiv.org/pdf/1511.06380.pdf

Prediction, cognition and the brain

Video-classification-literatures

Read "Fine-grained Video Classification and Captioning" https://arxiv.org/pdf/1804.09235.pdf and mark relevant points in paper (1) to discuss with team.
Choose another paper that attempts Video classification among the ones listed in the "To do" column (all papers that don't have 'Predictive' in there title :P)
Convert the card into a issue and assign it to yourself
Write a critical summary of the Paper as comment for the other 2 to read.

read and summarize - Beyond Short Snippets: Deep Networks for Video Classification

This paper can be considered as an aggregate of the Karpathy et al. , Zissermann et al. and related papers :

For temporal fusion tries convolutional fusion (i.e. the 3 techniques of Karpathy et al. )
For temporal fusion also tries LSTMs
Tries using optical flow along with raw frames
tries both AlexNet and GoogLeNet
works on sports-1M dataset and UCF-101 dataset
they use just 1 frame-per-sec and then train on super long videos instead

Results :

LSTMs generally do better than convolutional features fusion
get SOTA results on both dataset when they train on the whole long video (120 seconds)
Using optimal flow gave SOTA results on UCF-101 but on Sports-1M it did that only when used with LSTMs

PredNet critique:

List the specific reasons (backed by viz results) why PredNet fails on action dataset.

Read and summarize "Deep Predictive Coding Network for Object Recognition"

https://arxiv.org/abs/1802.04762