xuguohai / x-clip Goto Github PK

View Code? Open in Web Editor NEW

125.0 2.0 15.0 1.6 MB

An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"

Home Page: https://arxiv.org/abs/2207.07285

License: MIT License

Python 97.11% Shell 2.89%

multimodal video-text-retrieval msrvtt activitynet didemo lsmdc msvd

x-clip's People

Contributors

Stargazers

Watchers

Forkers

ziyang412 xmu-xiaoma666 anminhhung dannielge datnth1709 benjamesbabala vineetparikh xinhen jareturing petrosgk bupabupa bibisbar mrhuangam xlggzzz insafim

x-clip's Issues

The use of transpose in sentence-frame score

X-CLIP/modules/modeling_xclip.py

Line 327 in 6b5344f

    
           sentence_frame_logits = logit_scale * torch.sum(torch.matmul(sentence_output, frame_features.permute(0, 2, 1)) \

Hi, thank you for the wonderful job!

I suppose the use of .t() in sentence-frame score , it seems that this transpose make the original [bs_text, bs_video] to [bs_video, bs_text], which make this score inconsistent with other scores. I am wondering whether my understanding is correct.

Thanks! Hope to discuss with you!

SeqTransf & meanP

Dear Author,

I really am appreciated and fascinated by your work, and feel thankful of releasing your code.

I know that CLIP4clip + meanP have all the best performance among CLIP4Clip + seqTranf, seqLSTM, and tightTransf,

But I found that in your script, always seqTransf are recommended in sh files.

Is that any special reason that why "sim_header == seqTransf" is default setting?

I had looked your Table 2 on MSVD, your model recorded X-CLIP(ViT-B/32) R@1 scores 47.1 .
Is it mean that when X-Clip with seqTransf is the best than any other mode -meanP, tightTransf- ?
I cannot find that what kind of sim_header retrieved that scores in that table.

If X-CLIP + seqtrasnf is recommended anyway,
any special reason why seqTrasnf outperforms than meanP, unlike Clip4Clip did?

Sincerely,

How to prepare ActivityNet dataset？the guides from CLIP4Clip is equivocal

Question about ablation study of the different contrastive modules in Tab. 6

Hi, thanks for your nice and open-sourced and job.

I have some questions about the experimental setup in Tab. 6.

Are all the experimental results obtained through retraining, or are the Exp1-14's experimental results obtained only by inferring on Exp15's checkpoint?

According to my experimental results, there is an obvious decrease if only infer on Exp15's checkpoint. So, I guess all the experimental results were obtained through retraining. Hope to get your confirmation.

Poor performance when reproduce model on ActivityNet.

Due to the huge size of original dataset, I extracted images from the original videos with FPS=1, and trained the CLIP4clip(meanP) on 8 RTX3090. Due to the GPU memory constrain, I set the gradient_accumulation_steps=2.

The caption is downloaded from https://cs.stanford.edu/people/ranjaykrishna/densevid/.

I first try to reproduce the results of CLIP4clip(meanP / ViT-B/32) on ActivityNet and get R@1=37.9 which is much worse than 40.5 reported in Table 5.

Do authors have any useful experience on this issue? Thanks very much!

Finetuned model weights?

Hello! I was wondering if you have the final model weights after you finished training the model. I know you initialize with the CLIP weights, but it would be super helpful to have the final model weights as well. Thank you!

xuguohai / x-clip Goto Github PK

x-clip's People

Contributors

Stargazers

Watchers

Forkers

x-clip's Issues

The use of transpose in sentence-frame score

SeqTransf & meanP

How to prepare ActivityNet dataset？the guides from CLIP4Clip is equivocal

Question about ablation study of the different contrastive modules in Tab. 6

Poor performance when reproduce model on ActivityNet.

Finetuned model weights?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent