Performing video classificaiton by using the predictive processing architecture. The model is trained in a self-supervised manner to predict the next frames in videos along with the supervised video action classification task.
Build a benchmark NN architecture for the 20bn something-something-v2 dataset.
Task : Fine-grained action classification and video captioning.
Test : Transfer learning ability on kitchen-action dataset improved because of fine-grained labels.
Dataset -
220,000 videos
174 fine-grained actions
frame rate - 12fps
Pre-processing - (a) if number of frames < 48, replicate the first and last frames. (b) Resize the frames to 128ร128 (c) random cropping of size 96ร96 for testing use center crop instead
Architecture-
see Fig 1, Table 2
Encoder-decoder architecture
Trained end-to-end
Encoder - 2 channels of CNN (2D and 3D) + bi-LSTM RNN for video encoding
shared features for classification and captioning. loss is a weighed combination of the 2 tasks (page 5, equation 2). Best lambda = 0.1.
Decoder for captioning is 2-layer LSTM, for classification is a 2-layer MLP
Baseline-
VGG16 ImageNet pre-trained + 2 more layers of LSTM to integrate over 48 frames = 31.7 %
Experiments and results-
Different combinations of number of channels in 2D CNN and 3D CNN (Table 2) - both equal with 256 channels shows best results on classification task = 50.5%
Captioning - Preprocessing - (a) converted fine-grained (Eg. blue cap, men's short sleeve shirt) captions to simplified captions (Eg. cap, shirt) (b) replaced words occuring < 5 times by [Something] Baseline - Replace all occurrences of [something] with the most likely object string conditioned on that action class from classifier with simplified captions. Exact-Match accuracy = 5.7%. Results (Table 5) Exact-Match accuracy = 8.63%, classification improved slightly to =51.3%
Transfer learning -
On a small dataset "kitchenware" with 13 labels and 390 videos. Procedure - Freeze each pre-trained model and then train a classifier (logistic regression, MLP, LSTM) on top. The pre-trained model produce 12 vectors per sec of the video. Results(Fig 4) 10-shot training on kitchenware + pre-trained on fine-grained captions and fine-grained action class show best result = 63.5%
Read "Fine-grained Video Classification and Captioning" https://arxiv.org/pdf/1804.09235.pdf and mark relevant points in paper (1) to discuss with team.
Choose another paper that attempts Video classification among the ones listed in the "To do" column (all papers that don't have 'Predictive' in there title :P)
Convert the card into a issue and assign it to yourself
Write a critical summary of the Paper as comment for the other 2 to read.