Value for dropout?

The provided TensorFlow model specification for the i3d network includes dropout right before the last Unit3d layer. May I ask which value has been used for this dropout during the training on Kinetics and during the fine-tuning experiments?

I couldn't find it in the paper (but maybe its there and I am just blind).

train from scratch on ucf101 dataset

We try to train i3d model on ucf101 from scratch, but it converges much slower with a final validation accuray around 60%. Can you offer some suggestions on train i3d model without imagenet pretrained model.

IDT Handcrafted Features

Hi, thank you for sharing this great work.

Would you give more details about how the IDT is used to improve the results? Which library you used to calculate? How the merge with the pseudo features is done?

How to test using the whole videos?

According to your paper, when testing, I should send the whole video to the architecture.
When training, the network will produce a tensor of size: B x 7 x 1 x 1 x 400, and we average along the temporal dimension and squeeze to get a probability of size : B x 400

When testing, do we just simply send the whole video to the network and averaging over the temporal dimension?

Hope you can give me the correct method. Thanks for your job

Feature Extraction from Last Global Average Pooling Layer

I am trying to extract features from the last global average pooling layer. but the final tensor after

net = tf.nn.avg_pool3d(net, ksize=[1, 2, 7, 7, 1],
                             strides=[1, 1, 1, 1, 1], padding=snt.VALID)

is of size (1, 6, 1, 1, 1024) Is there a meaning in that ? am I doing something wrong ? I was hoping for only 1 feature vector of size 1x1024 not 6 of them

train on my own dataset

Hello! I trained I3D model on my own dataset, 2 classes, each about 50 videos, the two classes are similar, like open the door/ close the door, after 40 epochs, train_accuracy is 90+%,but the val accuracy is just 50%, the model didn't learn anything useful ! How could I do?

about batch size


In the paper "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset", it is said you used 64 GPUs to train 3D ConvNets.
But in readme, you said minibatch size is 6.
Does it mean that during multi-gpu training, the batch size is 6*64 = 384 ?



Hello ,

RGB or BGR ?

I wonder that whether the image color you used is RGB or BGR ?
The opencv and PIL in python use different color formats and I'm not sure how much this will cause
Any help will be appreciated.

Shape of *.npy file?

In the sample code,the example video has been preprocessed, with RGB and Flow NumPy arrays provided.
I want to test my own video, so I consider it might be a way to generate my own Numpy arrays and replace the example ones.
For RGB, The provided *.npy file has shape (1, num_frames, 224, 224, 3). It seems that 'num_frames' means number of frames, '224,224' means heights and widths, ‘3’ means channel(RGB). I'm coufused about ‘1’, what does this mean?And its value?

By the way, what's the equation of norm of the logits tensor?

Is it better to train from scratch on Kinetics-600?

I wonder why you only release checkpoint on Kinetics-600 trained from scratch but not from ImageNet pre-trained parameters.
In the paper, performance is better with ImageNet pre-training on Kinetics-400 dataset.
Is it better to train from scratch on Kinetics-600 dataset?


What regularizers do you use when training?

I noticed that in Conv3D or BatchNorm modules, the default regularizers are None.
Do you use regularizers in Conv3d or BatchNorm when training? If so, do you use L1 or L2 regularizers?

depth data

I'm only allowed to use the depth data and not the RGB (due to privacy issues). Could you please tell me if I can still use kinetics-i3d?


automatic sign language recognition.

I'm trying to do sign language recognition in running time so I'm wondering if this model the right choice to take here, and I'm wonder what kind of GPUs are necessary to train such a model.



Hi, I am having difficulties running the code because of library issues. I am current using Tensorflow-gpu 0.11.0, Tensorflow-probability-gpu 0.4.0 and Sonnet 1.29. Can anybody help me with the combination of Tensorflow-gpu, Tensorflow-probability-gpu and Sonnet versions that work you when running the code? Thanks.

Missing data in Kinetics 400

Thanks much for the great work.
I found there are about 10% of missing data in the train set of Kinetics-400.
Is it consistent with your findings, or should I look into improving my download scripts ? :)

Image preprocessing

The readme says to scale the RGB values between -1 and 1. Does this mean x/128.0-1.0, where x is an uint8 image?
I'm more used to seeing normalizing images with mean and std, so I want to make sure.

Finetune the pretrained model on UCF101

Hi ,
When I Finetune the pretrained model on UCF101, I adapt the, only use rgb' input, change the _NUM_CLASSES` to 101, add loss and optimizer after the logits, feed the training data and label to net, but I encounter the error messages:

2017-09-02 15:28:24.771133: W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 15:28:24.771169: W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 15:28:24.771190: W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 15:28:24.771194: W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 15:28:24.771198: W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 15:28:25.113985: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:84:00.0
Total memory: 10.91GiB
Free memory: 2.11GiB
2017-09-02 15:28:25.114035: I tensorflow/core/common_runtime/gpu/] DMA: 0
2017-09-02 15:28:25.114043: I tensorflow/core/common_runtime/gpu/] 0:   Y
2017-09-02 15:28:25.114067: I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0)
INFO:tensorflow:Restoring parameters from data/checkpoints/rgb_scratch/model.ckpt
Traceback (most recent call last):
  File "", line 175, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "", line 133, in main
    feed_dict={rgb_input:batch_xs, rgb_y: batch_ys})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/", line 895, in run
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/", line 1100, in _run
    % (np_val.shape,, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (8, 101) for Tensor u'Placeholder_1:0', which has shape '(?, 400)'

Here is my python file:

# Copyright 2017 Google Inc.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Loads a sample video and classifies using a trained Kinetics checkpoint."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf

import i3d
from  dataset import Dataset

batch_size = 8
training_iter = 1000
learning_rate = 0.001


    'rgb': 'data/v_CricketShot_g04_c01_rgb.npy',
    'flow': 'data/v_CricketShot_g04_c01_flow.npy',

    'rgb': 'data/checkpoints/rgb_scratch/model.ckpt',
    'flow': 'data/checkpoints/flow_scratch/model.ckpt',
    'rgb_imagenet': 'data/checkpoints/rgb_imagenet/model.ckpt',
    'flow_imagenet': 'data/checkpoints/flow_imagenet/model.ckpt',

_LABEL_MAP_PATH = 'data/label_map.txt'

FLAGS = tf.flags.FLAGS

tf.flags.DEFINE_string('eval_type', 'rgb', 'rgb, flow, or joint')
tf.flags.DEFINE_boolean('imagenet_pretrained', True, '')

def main(unused_argv):
  eval_type = FLAGS.eval_type
  imagenet_pretrained = FLAGS.imagenet_pretrained

  if eval_type not in ['rgb', 'flow', 'joint']:
    raise ValueError('Bad `eval_type`, must be one of rgb, flow, joint')

  kinetics_classes = [x.strip() for x in open(_LABEL_MAP_PATH)]

  if eval_type in ['rgb', 'joint']:
    # RGB input has 3 channels.
    rgb_input = tf.placeholder(
        shape=(batch_size, 10, _IMAGE_SIZE, _IMAGE_SIZE, 3))
    rgb_y = tf.placeholder(tf.float32, [None, _NUM_CLASSES])
    with tf.variable_scope('RGB'):
      rgb_model = i3d.InceptionI3d(
          _NUM_CLASSES, spatial_squeeze=False, final_endpoint='Logits')
      rgb_logits, _ = rgb_model(
          rgb_input, is_training=True, dropout_keep_prob=1.0)
    rgb_variable_map = {}
    for variable in tf.global_variables():
      if'/')[0] == 'RGB':
        rgb_variable_map[':0', '')] = variable
        print('===variable:', variable)
    rgb_saver = tf.train.Saver(var_list=rgb_variable_map, reshape=True)
#    print('=====variables', rgb_variable_map)

  if eval_type in ['flow', 'joint']:
    # Flow input has only 2 channels.
    flow_input = tf.placeholder(
    with tf.variable_scope('Flow'):
      flow_model = i3d.InceptionI3d(
          _NUM_CLASSES, spatial_squeeze=True, final_endpoint='Logits')
      flow_logits, _ = flow_model(
          flow_input, is_training=False, dropout_keep_prob=1.0)
    flow_variable_map = {}
    for variable in tf.global_variables():
      if'/')[0] == 'Flow':
        flow_variable_map[':0', '')] = variable
    flow_saver = tf.train.Saver(var_list=flow_variable_map, reshape=True)

  if eval_type == 'rgb':
    model_logits = rgb_logits
  elif eval_type == 'flow':
    model_logits = flow_logits
    model_logits = rgb_logits + flow_logits
  model_predictions = tf.nn.softmax(model_logits)
  print( '===model_predictions.shape:', model_predictions.shape)
  model_predictions = tf.reduce_mean(model_predictions, (1,2))
  print( '===model_predictions.shape:', model_predictions.shape)
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=model_predictions, labels=rgb_y))
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)

  dataset = Dataset('data/rgb_train_split1.txt', 'data/rgb_test_split1.txt')
  config = tf.ConfigProto()
  config.gpu_options.allow_growth = True
  with tf.Session(config=config) as sess:
    step = 1
    while step < training_iter:
      batch_xs, batch_ys = dataset.next_batch(batch_size, 'train')
      rgb_saver.restore(sess, _CHECKPOINT_PATHS['rgb'])
        feed_dict={rgb_input:batch_xs, rgb_y: batch_ys})

if __name__ == '__main__':

Generating Optical Flow


How to generate optical flows using GPUs? Seems this is being difficult to achieve using Python. Can you help with this ?

Thank you

LiteFlowNet over OpenCV TV-L1 optical flow algorithm. Can we ?

I have read about other's people problematic experiences with OpenCV's TV-L1 optical flow algorithm about how time consuming it is and I have also witnessed it myself first-hand.
My question is simple, is it legitimate to choose another Optical Flow Estimation way, for instance the LiteFlowNet ( which is currently the state-of-the-art at CVPR 2018. will this affect the results, especially when I intend to use the Flow Kinetics-i3d model solely for feature extraction purposes ?
Thanks in advance.


Hello @derpson

I'm doing research regarding action recognition.
I just downloaded UCF 101 dataset for action Recognition
But I need MetaData and Json Files and Description Files.
If any one can help us please forward the files.

Can any one please help me in Data Augmentation for this Dataset.

How can I apply the first N layers of the model to a video file?

Hi all,

I want to use the pre-trained model to process several video files, but I don't want to classify them. I only want to extract the properties of the first N layers (2-3 layers) to see if there are differences between the different video files (they are very similar).

After the prediction function, how can I extract the different outputs of the first layers?

Thank you in advance.

The difference between the 3D Inception Module and the 2D.

The Inception_v1 have 3x3 and 5x5 convolution layer and the bn_inception have 3x3 and two 3x3 conv layer in the middle branch in each inception module. But it the 3D inception module, you only save 2 3x3 conv layer, one for each branch. Can you tell me why you do this and when you transfer 2D bn-inception parameters to 3D model, do you just ignore the second 3x3 conv layer in the second branch?

Regarding TV-L1 Optical Flow


Due to the nature of TV-L1 optical flow algorithm, it is quite time consuming to process (but I have more than a 100k videos I must process which makes it quite frustrating to watch),

  1. Are you aware of any codes or methods to speed up the process ? (apart from changing the size of the input)
  2. This may be a no brainer, but how problematic will be to input to the pretrained model a different dense optical flow algorithm produced results such as farneback?

Thank you,

Data-preprocessing for kinetics-400

Hi, I would like to know how to preprocess the kinetics-400 for reproducing the results. I found that extracting tvl1 flow before rescale the rgb images leads to worse flow recognition accuracy.
So, currently, I first resampling videos at 25 fps. Then I extract rgb frames and resize with shorter side setting 256 pixels. I am using opencv3.4 version of cv::cuda::OpticalFlowDual_TVL1 for flow extraction on the resize gray-scale frames. All the pixels values are rescale as mention in the project. Are there any details i am missing in this preprossing procedure? Or, am I conducting the right way for extracting optical flow? Thanks.

Regarding the 2 dimensions of the Optical Flow

Hi, I have a question regarding the explanation of the optical flow used. The git page states,

We only use the first two output dimensions, and apply the same cropping as for RGB. The provided .npy file thus has shape (1, num_frames, 224, 224, 2)

However, I was wondering what this is referring to exactly. Is this the stack of u and v, the output of the TVL1?(if that is the case, just wondering in what order?) Or do you make it into a rgb image and use just the rg ?

This was a little unclear for me, thanks.

