Code Monkey home page Code Monkey logo

mil-nce_howto100m's People

Contributors

antoine77340 avatar bryant1410 avatar roudimit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mil-nce_howto100m's Issues

Parameters for replicating Zero-Shot evaluation retrieval results

Hello,

Thank you so much for sharing your code and pretrained models. I was trying to replicate your text-video retrieval results on the MSR-VTT dataset. I obtained the pretrained model from here - https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth. I ran the command mentioned in the README to perform the evaluation but using a smaller batch size, I didn't change any of the other parameters:

python3 eval_msrvtt.py --batch_size=2  --num_thread_reader=20 --num_windows_test=10 --eval_video_root=path_to_videos --pretrain_cnn_path=path_to_pretrained_model

I get the following results:

R@1: 0.1 - R@5: 0.25 - R@10: 0.35 - Median R: 31.0

These numbers are much lower than the ones mentioned in the table. I am guessing that the evaluation parameters are different since changing the batch size should not affect the results. Could you please tell me what parameter values were used to obtain the results mentioned in the table?

Thank you!

Training with smaller batch sizes

Hi @antoine77340 ,

Thank you very much for publishing this great paper and code.

I'm wondering if you can share some insights regarding the effect of batch size on the results. Did you ever attempt to go below batch size 512? (e.g., 32/64/128?). While trying to reproduce the results, I wasn't able to get very far with batch size 64 (unfortunately, larger batch sizes are out of reach for me w.r.t compute).

Thanks!!
Amir

Log

Hello, thank you for this great work. I am customizing this repo with my own dataset and I am wondering if you can provide the log file for the training. Thanks you ahead!

video-caption pair

I noticed that the caption csv file seems to have many overlaps for each clip. I want to see the long range pair, but I think it'll be a problem if I just concatenate them. Is there any way to get a video-caption pair? Do I have to use ASR on my own?
By the way, thanks for sharing nice work.

about sentence embedding

Hi! just want to ask a quick question about sentence embedding. In this work the sentence embedding is just 2 fc with a max-pooling. I am wondering if you have experimented with more complex sentence embeddings, such as BERT? Is the current design to save computation costs, or does it work better than more complex models? Thanks!

Training speed

Thanks for this nice work. Could you provide a rough estimation of the running time for this implementation?

Currently, it takes around 2.5 hours to train one epoch and seems much slower than the normal case. (Total batch size 2048, 4 x 8V100, 32G)

Thank you!

Question about Sentence_Embedding

Hi,
first of all thank you very much for sharing this!

I have two questions regarding the sentence embedding class.

  1. Why is th.no_grad() used with the embedding even if no pretrained embeddings are used?
  2. Does the model account for the zero padding in the text? E.g. by replacing the embeddings for padded words by zero before pooling.

Thank you!

Other hyperparams in pre-trained checkpoint (eg. learning rate)

Hi, thanks for this very useful code.

When I was trying to reproduce and train the model based on the pre-trained weights from S3D_Howto100m, the model quickly outputs all NaN for the video and text embeddings after 132 steps with batch size 1024, which is very strange (still same when I tested different learning rates). I found in the provided checkpoint, there is only weights of the network but no other hyperparameters like the learning rate.

Could you please share the hyperparams after pretraining (maybe this could be the issue)? Also it would be much appreciated if you could shed some light on the bugs I got.

PS. I use the provided S3D video features and only keep the very last linear layer for training the video encoder.

Pre-processed video download

Hello,

can you please explain what you mean by preprocessed videos in "Finally the preprocessed HowTo100M videos (12Tb in total)..."?

How are they preprocessed?

About the MILNCELoss

Hi,

  1. What is the input for the MILNCELoss function?
    Is it like this:
    video_embd: batch x D
    text_embd: batch x D
    where D=512?

  2. Why do you cat the x and x transpose here?
    denominator = th.cat((x, x.permute(1,0,2)), dim=1).view(x.shape[0], -1)
    isn't that th.logsumexp(x, dim=1) already computed the log sum in the denominator?

Thanks

Do the model see all the training clips at least once?

Thanks for the good work! As per the README, "An epoch here is equivalent of processing 1238911 video-text training samples, which is the number of different videos in HowTo100M. It is not the same as the number of different training video clips as there are more than 100M clips." Further, the clips are chosen randomly from a long video (here). Is it possible that the model is not looking at some clips in the dataset? I know that will not have a significant impact on the performance but just I am checking my understanding. Thanks!

Minor mistake in MILNCELoss

Hi there,

I'm interested in your paper and the proposed MIL-NCE Loss. So I read the code about this loss.
However, I found a mistake in your loss function.

The original design in your paper should be loss = log(pos/(pos+neg)).
But in the code implementation, you might mistakenly design it as loss = log(pos/(pos+pos+neg)), leading the minimum of this loss to be 0.6931. Luckily, this would have little impact when batch size is large.

You could check the mistake by this code snippet:

import MILNCELoss

loss_fn = MILNCELoss()
video = torch.Tensor([[100,0,0,0,0,0]]).cuda()
>>> tensor([[100.,   0.,   0.,   0.,   0.,   0.]], device='cuda:0')
text = torch.Tensor([[100,0,0,0,0,0],[100,0,0,0,0,0],[100,0,0,0,0,0]]).cuda()
>>> tensor([[100.,   0.,   0.,   0.,   0.,   0.],
        [100.,   0.,   0.,   0.,   0.,   0.],
        [100.,   0.,   0.,   0.,   0.,   0.]], device='cuda:0')
loss_fn(video, text)
>>> tensor(0.6934, device='cuda:0')

hi,about your loss ~~

hi , dear author:

Thank you open your code~~ I don't know if you can write a simple code to give a example to use MIL-NCE loss ? ╥﹏╥..., I afraid I use it worsely ...

Best,
jun (●'◡'●)

YouCook and MSR-VTT Dataloaders

Hello,

(Edited)

Thank you for releasing the code. It's massively helpful. I had few queries regarding the dataloaders:

  1. Where is num_frames used? In args.py, it is flagged as a random seed.
  2. What is the difference between num_frames and num_clips? Why is num_clips set to 10 for eval_msrvtt?
  3. Consider the following code block
    np.linspace(start, max(start, end-self.num_sec - 0.4), num_clip)
    What is the role of num_sec then in this context?

Thanks in advance.

Error with dataloader

I got this error message while running dataloader with 40 workers to load Howto100M dataset. Just wondering if you ever encountered this. if not don't worry, it might be some problem with my setup.

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/workspace/pytorch_code/mmssl_graph/train_MILNCE.py", line 207, in main_worker
train(train_loader, model, criterion, optimizer, scheduler, epoch, train_dataset, writer, args)
File "/workspace/pytorch_code/mmssl_graph/train_MILNCE.py", line 225, in train
for i_batch, sample_batch in enumerate(train_loader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1065, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
TypeError: init() missing 2 required positional arguments: 'stdout' and 'stderr'

Environment

  • PyTorch Version (e.g., 1.0): 1.7.0
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.8
  • CUDA/cuDNN version: 11.1
  • GPU models and configuration: Tesla V100
  • Any other relevant information: Running with Docker container

normalized vector dot product

Hi! I noticed that in this paper you directly multiply the embedding vectors without normalizing them, as many of the recent self-supervised learning paper has done. Is there a specific reason for not doing the normalization? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.