Code Monkey home page Code Monkey logo

prednet-and-predictive-coding-a-critical-review's Introduction

rane

prednet-and-predictive-coding-a-critical-review's People

Contributors

roshanrane avatar szugyiedit avatar vageeshsaxena avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

prednet-and-predictive-coding-a-critical-review's Issues

Datasets

Research about the following datasets :

  1. something-something-v2
  2. UCF-101
  3. HMDB-51
  4. Sports-1M
  5. YouTube-8M
  6. KITTI dataset

Things to find out for all the datasets :

  1. Total videos
  2. fps and the resolution of each frame of the videos
  3. Check out few videos and list anything peculiar (like changing cameras within a video).

Read and summarize "Deep multi-scale video prediction beyond mean square error"

Key points :

  1. Facebook paper from LeCun (reliable)
  2. Point out the problems with L2 loss for video prediction
  3. Try different loss functions and its combinations - (a) l1 (b) l2 (c) Gradient difference loss/GDL (d) adversarial loss
  4. try different accuracy metrics - PSNR and SSIM

Results :
https://cs.nyu.edu/~mathieu/iclr2016.html

  1. Adversarial+GDL loss gives best results visually (my opinion) and in best PSNR scores
  2. Models with Optical flow - best SSIM scores.

20bn paper : Fine-grained Video Classification and Captioning

Problem statement -

Build a benchmark NN architecture for the 20bn something-something-v2 dataset.
Task : Fine-grained action classification and video captioning.
Test : Transfer learning ability on kitchen-action dataset improved because of fine-grained labels.

Dataset -

  1. 220,000 videos
  2. 174 fine-grained actions
  3. frame rate - 12fps
  4. Pre-processing - (a) if number of frames < 48, replicate the first and last frames. (b) Resize the frames to 128ร—128 (c) random cropping of size 96ร—96 for testing use center crop instead

Architecture-

see Fig 1, Table 2

  1. Encoder-decoder architecture
  2. Trained end-to-end
  3. Encoder - 2 channels of CNN (2D and 3D) + bi-LSTM RNN for video encoding
  4. shared features for classification and captioning. loss is a weighed combination of the 2 tasks (page 5, equation 2). Best lambda = 0.1.
  5. Decoder for captioning is 2-layer LSTM, for classification is a 2-layer MLP

Baseline-

  1. VGG16 ImageNet pre-trained + 2 more layers of LSTM to integrate over 48 frames = 31.7 %

Experiments and results-

  1. Different combinations of number of channels in 2D CNN and 3D CNN (Table 2) - both equal with 256 channels shows best results on classification task = 50.5%

  2. 50 labels (action groups) v/s 174 fine-grained action labels (Table 3) = Fine-grained better accuracy by 5%

  3. Captioning -
    Preprocessing - (a) converted fine-grained (Eg. blue cap, men's short sleeve shirt) captions to simplified captions (Eg. cap, shirt) (b) replaced words occuring < 5 times by [Something]
    Baseline - Replace all occurrences of [something] with the most likely object string conditioned on that action class from classifier with simplified captions. Exact-Match accuracy = 5.7%.
    Results (Table 5) Exact-Match accuracy = 8.63%, classification improved slightly to =51.3%

  4. Transfer learning -
    On a small dataset "kitchenware" with 13 labels and 390 videos.
    Procedure - Freeze each pre-trained model and then train a classifier (logistic regression, MLP, LSTM) on top. The pre-trained model produce 12 vectors per sec of the video.
    Results(Fig 4) 10-shot training on kitchenware + pre-trained on fine-grained captions and fine-grained action class show best result = 63.5%

Video-classification-literatures

  1. Read "Fine-grained Video Classification and Captioning" https://arxiv.org/pdf/1804.09235.pdf and mark relevant points in paper (1) to discuss with team.
  2. Choose another paper that attempts Video classification among the ones listed in the "To do" column (all papers that don't have 'Predictive' in there title :P)
  3. Convert the card into a issue and assign it to yourself
  4. Write a critical summary of the Paper as comment for the other 2 to read.

read and summarize - Beyond Short Snippets: Deep Networks for Video Classification

This paper can be considered as an aggregate of the Karpathy et al. , Zissermann et al. and related papers :

  1. For temporal fusion tries convolutional fusion (i.e. the 3 techniques of Karpathy et al. )
  2. For temporal fusion also tries LSTMs
  3. Tries using optical flow along with raw frames
  4. tries both AlexNet and GoogLeNet
  5. works on sports-1M dataset and UCF-101 dataset
  6. they use just 1 frame-per-sec and then train on super long videos instead

Results :

  1. LSTMs generally do better than convolutional features fusion
  2. get SOTA results on both dataset when they train on the whole long video (120 seconds)
  3. Using optimal flow gave SOTA results on UCF-101 but on Sports-1M it did that only when used with LSTMs

PredNet critique:

List the specific reasons (backed by viz results) why PredNet fails on action dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.