v-iashin / video_features Goto Github PK

Extract video features from raw videos using multiple GPUs. We support RAFT flow frames as well as S3D, I3D, R(2+1)D, VGGish, CLIP, and TIMM models.

Home Page: https://v-iashin.github.io/video_features

License: MIT License

Python 98.17% Dockerfile 1.83%

audio-features clip feature-extraction i3d ig65m laion multi-gpu optical-flow parallel pytorch r2plus1d raft resnet s3d swin timm vggish video-features visual-features vit

video_features's People

Contributors

Stargazers

Watchers

Forkers

ysminabk limitmhw ustcjwyang lzy4021 amirunpri2018 hjxh sssuhaoyu jmgaljaard yhzhai ssmgg ankitshah009 mihaianton onlyonewater fork-for-modify valterlej yukaryote kamino666 engineerkhan yqgao716 patttto lifeifei-airs jhong93 thameemtraining ltp1995 huzeyann alceballosa allezsyh czt117 nums11 tekinonlayn ohjho videosummarizer jasongief zhongyy zoharli vassmorozov wolfworld6 779399462 jeromeyaoh cz26 juntae9926 danialashidiq lavrikov chanivan98 pixlrainbow wumuyu9 itachi-452b m-bain seejiafong jtotoole caspeerrr habakan mazzz56 siddd25 polinli pwzy jackzhousz linxi1158 shivvrat aspnetcs zhipengy nwut ilpoviertola eshaanmandal inoculate23 hamed-ebrahimi-iran bjuncek rezowan-ferdous yuanqinglee chrisindris shujunyy123 emotion-recognition-reproduce py85252876 haoheliu min-kid zhuwenhao vividus-tfg daweiro junxi-chen palimisis karthik111 swat1563 huihojeun2 realcvonge mjahmadee smallwhite999 hitcbw lakersli ishara-ai

video_features's Issues

is tencrop avaliable?

hi! thanks for your repo, which is very useful.
i wondering about torchvision.transforms.TenCrop apply to your code.

i was attempt to change some code in extract_resnet.py

class ExtraResNet in extra_resnet.py

self.transforms = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize(256),
transforms.CenterCrop(224), -> transforms.TenCrop(224),
transforms.ToTensor(), -> transforms.Lambda(lambda crops: torch.stack([transforms.ToTensor()(crop) for crop in crops])),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ->
transforms.Lambda(lambda normal: torch.stack([transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(normalize) for normalize in normal]))
])

but i got error as below:
RuntimeError: Expected 3d (unbatched) or 4d (batched) input to conv2d, but got input of size: [64, 10, 3, 224, 224]

please let me know if you have idea. thanks!

OSError with Anaconda, works on Miniconda

Hi , this is a greate work for my stduy ,But I had a problem running the demo.

(torch_zoo) czy@PT6630W:/data1/czy/affectivevideo/video_features$ /home/czy/anaconda3/envs/torch_zoo/bin/python3.8 main.py feature_type=r21d device_ids="[0]" video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"
Traceback (most recent call last):
  File "main.py", line 6, in <module>
    import torch
  File "/home/czy/anaconda3/envs/torch_zoo/lib/python3.8/site-packages/torch/__init__.py", line 189, in <module>
    _load_global_deps()
  File "/home/czy/anaconda3/envs/torch_zoo/lib/python3.8/site-packages/torch/__init__.py", line 142, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/home/czy/anaconda3/envs/torch_zoo/lib/python3.8/ctypes/__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/czy/anaconda3/envs/torch_zoo/lib/python3.8/site-packages/torch/lib/../../../../libcublas.so.11: symbol free_gemm_select version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

Is there any way to solve this problem？Looking forward to someone's reply. Thanks

tmp file can't be stored and len() of unsized object

hi Vladimir, thanks for ur awesome and systemantic work !
there are three issues when i try my best to ‘corporate’ with ur project.
specifically, i just want to extract features of the msvd & msr-vtt dataset for video captioning task.
so, the pipeline kept in my mind like followings.
first, the video dataset, msvd or msr-vtt, is cut into rgb frames one-by-one.
then, the flow frames are obtained based on those rgb frames.
next, the rgb & flow frames are sent to the cnn encoder (inceptionresnetv2, i3d) and finally the features can be extracted.
back to issues, in my mind, the rgb & flow frames will be saved at the tmp file according to ur documention.
however, 1) the tmp file can't be seen in my workspace although i set the --keep_tmp_file parameter and run main.py file.
besides, 2) the warning ‘The given Numpy array is not writeable, and ...’ always displayed as long as the main.py file runs.
the warning caused by the code 'img=torch.as_tensor(np.asarray(pic))' in transformers file. i attempt to fix it but failed.
the third problems i met is 3) the notice 'Extraction failed at: ./sample/....mp4 with error. Continuing extraction len() of unsized object' occurs after setting --on_extraction parameter. (ps: i have a single gpu device)
that's all. if you have any great advice on data preprocessing for video captioning task, welcome tell me.
looking forward to your reply !

Tests are too "heavy"

Problem: Currently ./tests folder occupies 122M, which is a bit too much for the purpose. It is mostly due to the size of the ref files.

Idea: The references are saved as tensors. It might be a bit redundant and we could aggregate (mean, the spatial dimension of the optical flow for instance which will significantly reduce the reference size.

Docker image supporting all models

It would be nice to have a docker container with everything installed.

docker
docs

Sampling rate of VGGish

Hi, i am new to VGGish and i want to use it to extract audio features. The website says the feature tensor will be 128-d and correspond to 0.96s of the original video. Can i change this sampling rate to other numbers? I see I3D model provieds some arguments to do this(like stack_size and step_size), can i do this in VGGish?

Add supports on Windows and CPU-only system

Hi, I'm very interested in this repo and I have a project based on it, and I've made some changes to a previous version of this repo(including adding new models, etc.)Kamino666/video_features. Seeing that you have updated this repo, I would like to gradually add my previous changes to it. First I want to support Windows and CPU-only systems.

I understand that currently linux and GPU are generally used for deep learning, but adding support for windows and cpu can help people debug. And I don't think it will cause very big changes.

Windows
Basically, the incompatibility comes from this function, bacause Windows doesn't have which command, which can be replaced with where command.

def which_ffmpeg() -> str:
    '''Determines the path to ffmpeg library

    Returns:
        str -- path to the library
    '''
    result = subprocess.run(['which', 'ffmpeg'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    ffmpeg_path = result.stdout.decode('utf-8').replace('\n', '')
    return ffmpeg_path

cpu-only
A new item needs to be added to the configuration to specify whether or not to use the GPU, and some minor changes need to be made in main.py. However, in my tests, the PWC model does not work with the CPU, and I am not very clear about this model.

Can I submit a PR to address the above issue?

——
@v-iashin : added backlog:

windows
Cpu

typeerror

(i3d) root@ever:~/video_features# python main.py --feature_type i3d --device_ids 0 2 --video_paths ./sample/v_ZNVhz7ctTq0.mp4 ./sample/v_GGSY1Qvo990.mp4
Traceback (most recent call last):
File "main.py", line 94, in
sanity_check(args)
File "/root/video_features/utils/utils.py", line 105, in sanity_check
assert args.stack_size >= 10, message
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

Warnings: NumPy and OMP: Warning #190

Extraction from paths with the same filename are not supported

Currently, it is assumed that all user-specified files have distinct file names. However, it leads to unwanted behavior if a user extracts files from multiple folders, e.g.

dataset/
  dog/
    000.mp4
  cat/
    000.mp4

On a side note, it would be lovely to keep the folder structure, if any, during the save. Plus, model_name and feature_type should be a part of the path.

<output_path>/feature_type/(model_name/)<path_relative_to(os.common_path(list_of_paths))>.

Feature Extraction Problem in RTX 3090

I am trying to extract feature_type = i3d in RTX 3090. But it freezes the terminal after printing this:

feature_type: i3d
stack_size: 12
step_size: 2
streams: rgb
flow_type: pwc
extraction_fps: null
device: cuda:0
on_extraction: print
output_path: ./output/i3d
tmp_path: ./tmp/i3d
keep_tmp_files: false
show_pred: false
config: null
video_paths: null
file_with_video_paths: ./sample/sample_video_paths.txt

Device: cuda:0

and doesn't do anything after this.

Could you provide a pip requirements

Hello, thank you for your work. I want try to use this code to extract the i3d feature and vggish feature in the BMT branch. But the environment is hard for me to build. Cause some package can not be found in the anaconda environment. So, I want to know whether you can provide a pip requirements list.

A suggestion for speed up the processing

I found that the code will only processing one video on each GPU, and used about 2.5G memory, the memory usage is slow for GPUs which have lager memory. I wonder if there is a way like batch size or multithreading processing, and full use the GPU memory, also save the time to processing the dataset.

Some tests don't work on Google Colab

These are

all runs that involve ffmpeg and (reencode_video_with_diff_fps) extraction_fps. I am not sure if it is my local problem and colab's, or video encoding of the sample video.
ResNet because Google Colab runs the latest torchvision which changed the pre-trained weights for ResNet

commit: 9d3f3da5e242f6836bb2a674cb37c694c0e2e4b2
tests/clip/test_clip.py ..F                                              [ 15%]
tests/i3d/test_i3d.py .F..                                               [ 36%]
tests/r21d/test_r21d.py ...F                                             [ 57%]
tests/raft/test_raft.py FF..                                             [ 78%]
tests/resnet/test_resnet.py FFF                                          [ 94%]
tests/vggish/test_vggish.py F                                            [100%]

Problem of processing short video

Thank you for providing such a good repo. It really helps me a lot.
However, my videos are very short with less than 100 frames. When I tried my videos on your Colab to extract I3D feautures. I got nothing. I guess it's the problem of the setting of step_size, stack_size, extraction_fps. But when I tried to set them all at 1, I still got nothinng.
So, sorry for bothering you with this little problem. I am a fresh guy in coding. So, can you help out?
Many thanks.

The timestamps of some ending frames are 0

There are always a few frames at the end of the output feature with a timestamp of 0.

For example, when using resnet-50 to extract the feature of v_GGSY1Qvo990.mp4 at 2 fps, I got:

array([    0.,   500.,  1000.,  1500.,  2000.,  2500.,  3000.,  3500.,
        4000.,  4500.,  5000.,  5500.,  6000.,  6500.,  7000.,  7500.,
        8000.,  8500.,  9000.,  9500., 10000., 10500., 11000., 11500.,
       12000., 12500., 13000., 13500., 14000., 14500.,     0.,     0.,
           0.,     0.,     0.,     0.])

For another example, if I set args.extraction_fps = 2 in resnet colab example, I got :

[    0.   500.  1000.  1500.  2000.  2500.  3000.  3500.  4000.  4500.
  5000.  5500.  6000.  6500.  7000.  7500.  8000.  8500.  9000.  9500.
 10000. 10500. 11000. 11500. 12000. 12500. 13000. 13500. 14000. 14500.
 15000. 15500. 16000. 16500.     0.     0.]

I think we could calculate the timestamp of the i-th frame by i * (1 / fps), instead of cap.get(cv2.CAP_PROP_POS_MSEC).

A series of video frames

Can I extract features directly from a series of video frames?

How to extract from whole dataset

Hi, thank you for your work.

Now I'm trying to extract Flow feature, from 'dataset' or 'directory' not the exact one 'file'

As I can see through the main.py there is no argument for that.

Would you give me a tip for modify it?

Implementation of SlowFast and TSN

Would you consider adding the ability to extract video features using slowfast model

When the extraction_total parameter is set to 1, the problem of missing frames is encountered

Hi,I encountered a problem when I tried to extract a global feature for whole video. I set extraction_total parameter to 1. But there is a problem When executing the ffmpeg instruction. Is there a way to extract a global feature without video slice?

Add the check for file existence

The existence check
- Could it produce broken files and others will just skip it?
  - smoke-check each file (just loading would do)

Use bigger R(2+1)d pre-trained on IG65M

Hey there, awesome project you have here 🎉

I checked your code and seeing

video_features/models/r21d/extract_r21d.py

Line 9 in 71a9a08

from torchvision.models.video import r2plus1d_18

I was wondering if you have thought about dropping this one in

https://github.com/moabitcoin/ig65m-pytorch

since the r(2+1)d 18 features are - based on experience - not the best to work with, in the r(2+1)d family.

✌️

The output of `show_pred` may be confusing

Neither the documentation nor the program output mentions the meaning of each column, and now we have two outputs with the same format but different meanings.

examples:

Similarity | Probability | Sentences provided

23.061 0.962 a dog smiles
19.824 0.038 a woman is lifting

22.770 0.963 a dog smiles
19.520 0.037 a woman is lifting

24.619 0.929 a dog smiles
22.048 0.071 a woman is lifting

Logit | Probability | Label

33.584 0.594 a photo of snatch weight lifting
33.027 0.340 a photo of deadlifting
31.080 0.049 a photo of clean and jerk
28.425 0.003 a photo of situp
27.873 0.002 a photo of bench pressing

Do you think we should add some output to the code or explain it in the documentation?

"Detect missing frame + Warning: the value is empty for raft" ERROR when extraction_fps=1

Hi, first of all, thank you for sharing your amazing work!

I am trying to extract optical features for Charades-sta dataset with command

python main.py \
    feature_type=raft \
    batch_size=8 \
    extraction_fps=1
    on_extraction=save_numpy \
    device="cuda:4" \
    keep_tmp_files=true \
    video_paths="[/home/jckim/workspace/video_features/tmp/raft/0A8CF_new_fps.mp4]"

(never mind for video_paths the reason for new_fps video's are there is that i've tested if this was problem of re-encoding. neither raw video or re-encoded videos occur same problem)
With the command, I could get 1fps re-encoded video. But when extracting features,

happens like this

and the features extraceted are all blank.

Is there any way to solve this problem?

Thank you in advance!

Nasty LICENSE

Now the project inherits GPL-3.0 License from PWC which prevents adaptation of the library for some use-cases.

Probably will have to drop PWC support to make it MIT 😞

Could you tell me the kinetics's details in accuracy of you project

how to resize the out features ？

thanks for your great code！
In my recently works,i have to mix the features from different network ,but the out features' sizes were not match.
I want to mix the feature from the resnet50 and RAFT(or I3D )
I don't know how to deal with that,could some one help me?😥

New unified video IO

Your recent changes are great and have significantly reduced code redundancy. The code is also easier to read now. Here I have another idea that might make it a little better.

The process of extracting video frames is now defined in base_flow_extractor.py and base_framewise_extractor.py, which use ffmpeg to re-encode the video and opencv to read it. But users may need a new feature of extracting a fixed number of frames for a video, which is harder to implement with the current framework. And, users may want to be able to extract features faster, so could you consider using a more efficient API to read videos (e.g., decord). Moreover, specifying fps to extract features requires re-encoding the video and generating temporary files. Is this necessary? Can we meet the fps by extracting frames at intervals rather than re-encoding it?

The ideal API may looks like:

# when fps = 2
frame_extractor = FrameExtractor("sample.mp4", batchsize=16, fps=2)
# when we want to extract 12 frames
frame_extractor = FrameExtractor("sample.mp4", batchsize=16, total=12)
# when extracting flow 
frame_extractor = FrameExtractor("sample.mp4", batchsize=16, overlap=1)

for batch, time_stamp in frame_extractor:
    batch = transform(batch)  # or as a parameter for the frame extractor
    _run_on_a_batch(batch)

In summary, this new feature can further reduce code redundancy, improve speed and provide new extracting methods.

I would love to make this improvement, but I've been busy lately and probably won't be able to do it until September.

[I3D] Stop loading flow models in RGB-only mode

Importing and loading flow models is redundant when the only I3D stream is RGB.
I am planning to open a PR for this and hopefully #72 if you're interested.

Not Enough Memory Error on RTX 3050Ti

I want to run I3D on RTX 3050Ti 4GB, but it says not enough memory. Is there a solution to this problem. please guide me

Simple tests

Right now one needs to manually run tests (and write them/copy) to make sure the added feature does not cause trouble.

It would be nice to have a script that does it.

I think, two scripts are needed:

for API calls
- maybe making a set of files with correct outputs for every command and the test would compare the files produced with the new code state against the references.
for the import API (as in colab)
- similar to the API: it would load the references and compare the new tensors to these

Smoke tests as well:

shapes

Some extracted audio and video features of the same video have different length!

Thanks for your good project!
I used the same sample strategy to operate audio data and video frames, e.g., resample all video frames using 25 fps, and use 24 frames one time to extract a feature using i3d. At the same time, one audio feature represents a 0.96 audio clip. But I got different length features, e.g, audio with (162, 128) and video with (165, 1024). the video features length is correct but with the wrong audio feature length.
How do I deal with it?

Simplify extraction API to a single GPU

In the current state, the code is a bit ugly and the implementation is strange. I think it can be better.

In particular, currently, it relies on torch DataParallel primitives that are not familiar to most of the users and there should be a better way of doing it.

It looks like regressing it to a single GPU and adding a check for already-processed files will allow a user simply to run multiple processes (one per GPU) for the same input file list. One process will check if another process already did the job that one's is assigned to do. Another benefit is associated with an opportunity to run it on a cluster: queue many jobs with a single GPU and assign the same output folder. Therefore, this simple workaround will allow multi-node multi-GPU scaling. In addition, this will significantly simplify the code.

Simplify the code
Change docs
- advice on multi-node setup
- advice on multi-GPU setup (CUDA_VISIBLE_DEVICE perhaps or just the device id would do)

Support for FiveCrop/TenCrop

Hey, thanks for this repo.

I noticed a lot of architectures use five-cropped or ten-cropped features, so I am wondering if you considered/would consider implementing this feature.
Additionally, I am currently converting the frame rate of the videos before extracting the features, but I think it would be great if we could do both at once.

🐛 some minor bugs

Codes that have no use:
https://github.com/v-iashin/video_features/blob/master/utils/utils.py#L246-L251

Raised unnecessary UserWarning:

UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).

https://github.com/v-iashin/video_features/blob/master/models/clip/extract_clip.py#L184

getting system error while extracting wav file

Error opening './tmp/v_ZNVhz7ctTq0.wav': System error.
Extraction failed at: ./sample/v_ZNVhz7ctTq0.mp4. Continuing extraction

PEP 257: Triple double-quoted strings should be used for docstrings.

PyCharm is mad at me for the triple-quoted string form docstrings. I can convert the docstrings if interested.

PWCNet torch version checking error

if I'm using torch>=1.0.0 then the assert statement on line 21 in models/pwc/pwc_src /pwc_net.py will not work

simple fix, I could submit a PR if you'd like

Optical Flow (`show_pred`): Make a video instead of showing the results in a window

Currently, if show_pred argument is used with a optical flow feature extractor, the window dialog pops-up where a user needs to press some button.

Problem: it is nice but, when a machine does not have an attached screen, it fails.

Idea: we could avoid it by making an mp4 video with RGB and optical flow (or a .gif).

Implementation: maybe with cv2, maybe with imageio.get_writer

RFC: should we keep the old behaviour as an argument? Is it useful for anyone or this is just me enjoying looking at them manually?

ResolvePackageNotFound for conda_env_pwc.yml

GIves this error while trying to do this:

conda env create -f conda_env_pwc.yml

Error:
Retrieving notices: ...working... done
Collecting package metadata (repodata.json): -done
Solving environment: failed

ResolvePackageNotFound:

torchvision==0.4.0=py37_cu100
pytorch==1.2.0=py3.7_cuda10.0.130_cudnn7.6.2_0

CLIP reference and change log is missing

Hey @Kamino666

May I ask you to make a PR adding a header comment to the CLIP files?

e.g.

'''
    Reference: https://github.com/openai/CLIP/tree/<hash-of-the-state-you-used>
    Modified by Kamino (@Kamino666):
      - change
'''
<content starts>

You don't have to add a very elaborate change log, just a sentence describing what has changed compared to the originals.

InvalidArgumentError: Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero

Thank you for your great works in advance

I've encountered a problem:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
	 [[{{node vggish/Flatten/flatten/Reshape}}]]
	 [[{{node vggish/embedding}}]]

Some video datasets are fine but this problem occurs often :(

Here's my full logs:

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

  0%|                                                     | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
	 [[{{node vggish/Flatten/flatten/Reshape}}]]
	 [[{{node vggish/embedding}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/extract_vggish.py", line 60, in forward
    self.extract(idx, sess)
  File "/data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/extract_vggish.py", line 97, in extract
    [vggish_stack] = tf_session.run([embedding_tensor], feed_dict={features_tensor: examples_batch})
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
	 [[node vggish/Flatten/flatten/Reshape (defined at /data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py:95) ]]
	 [[node vggish/embedding (defined at /data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py:99) ]]

Caused by op 'vggish/Flatten/flatten/Reshape', defined at:
  File "/data02/jeiyoon_park/BMT/submodules/video_features/main.py", line 85, in <module>
    parallel_feature_extraction(args)
  File "/data02/jeiyoon_park/BMT/submodules/video_features/main.py", line 34, in parallel_feature_extraction
    torch.nn.parallel.parallel_apply(replicas[:len(inputs)], inputs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 79, in parallel_apply
    _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/extract_vggish.py", line 52, in forward
    vggish_slim.define_vggish_slim(training=False)
  File "/data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py", line 95, in define_vggish_slim
    net = slim.flatten(net)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1624, in flatten
    outputs = core_layers.flatten(inputs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/layers/core.py", line 333, in flatten
    return layer.apply(inputs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1227, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/layers/base.py", line 530, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py", line 553, in call
    array_ops.shape(inputs)[0], -1))
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 7179, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
	 [[node vggish/Flatten/flatten/Reshape (defined at /data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py:95) ]]
	 [[node vggish/embedding (defined at /data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py:99) ]]

Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
	 [[node vggish/Flatten/flatten/Reshape (defined at /data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py:95) ]]
	 [[node vggish/embedding (defined at /data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py:99) ]]

Caused by op 'vggish/Flatten/flatten/Reshape', defined at:
  File "/data02/jeiyoon_park/BMT/submodules/video_features/main.py", line 85, in <module>
    parallel_feature_extraction(args)
  File "/data02/jeiyoon_park/BMT/submodules/video_features/main.py", line 34, in parallel_feature_extraction
    torch.nn.parallel.parallel_apply(replicas[:len(inputs)], inputs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 79, in parallel_apply
    _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/extract_vggish.py", line 52, in forward
    vggish_slim.define_vggish_slim(training=False)
  File "/data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py", line 95, in define_vggish_slim
    net = slim.flatten(net)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1624, in flatten
    outputs = core_layers.flatten(inputs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/layers/core.py", line 333, in flatten
    return layer.apply(inputs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1227, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/layers/base.py", line 530, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py", line 553, in call
    array_ops.shape(inputs)[0], -1))
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 7179, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/data02/jeiyoon_park/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
	 [[node vggish/Flatten/flatten/Reshape (defined at /data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py:95) ]]
	 [[node vggish/embedding (defined at /data02/jeiyoon_park/BMT/submodules/video_features/models/vggish/vggish_src/vggish_slim.py:99) ]]

Extraction failed at: /data02/jeiyoon_park/fairseq/examples/MMPT/video_dataset/SumMe/SumMe/pasted_videos/Bearpark_climbing.mp4. Continuing extraction
100%|█████████████████████████████████████████████| 1/1 [00:04<00:00,  4.68s/it]

Process finished with exit code 0

PWC on RTX 3090: pytorch CUDNN_STATUS_MAPPING_ERROR and CUDA error: invalid texture reference

I have installed your library and use PWC conda env, but it will throw pytorch CUDNN_STATUS_MAPPING_ERROR and CUDA error: invalid texture reference errors, I run this code on RTX 3090 GPU.

Reencoding videos with different FPS fails when the video path contains spaces

The problem lies in the subprocess.call(cmd.split()) call in utils/utils/reencode_video_with_diff_fps. This can be fixed by creating the cmd arguments directly as a list.

How to extract I3D features on different span and stride

TSPNet/
├── i3d-features/
│ ├── span=8_stride=2
│ ├── span=12_stride=2
│ └── span=16_stride=2
├── data-bin/
│ └── phoenix2014T/
│ └── sp25000/
│
├── README.md
├── run-scripts/
└── test-scripts/

I want to extract i3d features on 3 different span (span = 8 _ stride= 2, span = 12_ stride = 2, span = 16 _ stride =2), how can I do this using your i3d feature extraction code. Can you please guide me?

Extracting frame-wise features with I3D

Hey. Is there any way to extract frame a frame features with I3D? I'd like to extract one feature vector per frame. What stack size and step size should I use to do that? Thanks!

Can we get different dimension output i3d features?

Current dimension 1024. I am using I3D features with an action former model, which takes in the 2048 D feature.

read_video error for slightly large videos when extracting S3D features.

I was trying to extract S3D features on a video (~51MB, 11 mins), and was getting an error at the very start of the extraction process, with a console message Killed.

This is occurring because in extract-S3D.py, we're using the read_video from torchvision.io.video to process the video file. I tried to execute only this statement separately and faced the same issue. However, I was able to process a smaller video file (<1MB, ~5 secs) and feature extraction then proceeded without a hitch. Same for the samples provided in the repo. This issue is not present in the I3D feature extraction. Probably because there you use the VideoCapture methods from OpenCV?

I'm trying to see if some other video reader works for this, but I am unsure if read_video applies any transforms before outputting the RGB torch array mentioned in the code. Can you suggest any workaround if this doesn't work?

The torchvision version in my environment is 0.12.0, omegaconf is 2.1.1 as described.

EDIT: I've tried extracting the features for the video I had issue with on the S3D colab notebook, but the kernel crashes there as well.

Add the new feature of being able to easily import this repo as a submodule

For now, we can use this repo to extract features easily and efficiently. Users may first extract features of the videos in the dataset and then use the features as input to train the model. But in the inference phase, it would be nice if we could import this repo as a submodule to extract feature(s) of an individual video and then perform inference.

For example, if I need visual feature and audio feature to perform bi-modal video captioning task, when doing inference, I need to execute the main.py in this repo and then execute the code of my task, which is complex.

What I want to achieve is something like your demo in Colab. The user may clone this repo as a submodule and do from video_features.models.r21d.extract_r21d import ExtractR21D to extract the feature. The major problem now is that the current implementation uses absolute import instead of relative import.

I think this improvement is very beneficial. Maybe it can be released in pypi sometime in the future.

Code is redundant and it is hard to maintain and add new models

Refactor the code base with OOP in mind – now the code is a bit redundant for each feature.