tgc1997 / rmn Goto Github PK
View Code? Open in Web Editor NEWIJCAI2020: Learning to Discretely Compose Reasoning Module Networks for Video Captioning
IJCAI2020: Learning to Discretely Compose Reasoning Module Networks for Video Captioning
您好!我想请教一下您这里的POS词性是怎么获得的,我看到只有0,1,2这三种类型,请问他们分别代表什么词性?谢谢!
Hi,
I see you using the 2D CNN features (1536 dim), 3D CNN features (1024 dim), RCNN features (2048 dim). I also see something called spatial features of 5 dimensions. What are these features? I could not find them mentioned anywhere in the paper?
File "/RMN-master/models/allennlp_beamsearch.py", line 257, in search
state_tensor.reshape(batch_size, self.beam_size, *last_dims)
RuntimeError: gather_out_cuda(): Expected dtype int64 for index
Hi tgc, I'd like to test this model on my own video. How could I get the extracted features as inputs?
the shape of sfeats of msvd_region_feature.h5 is 1970 x 26 x 36 x 5,
what's the meaning of the last dimensions?thank you!
Hello, may I ask what the method do you use to extract features and regional features from videos?Thank you
Line 128 in 14a9eff
if bsz == opt.train_batch_size:
loss_count /= 10
elif bsz < opt.train_batch_size and i % 10 == 0:
loss_count /= 10
else:
loss_count /= i % 10
The project on my server restart again now. If it still works well after executing one epoch, I will come back to report.
Hi, can i check whats the difference between the evaluate.py and train.py? tyvm
(rmn) E:\video_caption\rmn\RMN-master>python evaluate.py --dataset=msvd --model=RMN --result_dir=results/msvd_model --attention=gumbel --use_loc --use_rel --use_func --hidden_size=512 --att_size=512 --test_batch_size=2 --beam_size=
2 --eval_metric=CIDEr
335it [01:21, 4.13it/s]
init COCO-EVAL scorer
tokenization...
Traceback (most recent call last):
File "evaluate.py", line 107, in
metrics = evaluate(opt, net, opt.test_range, opt.test_prediction_txt_path, reference)
File "evaluate.py", line 75, in evaluate
scores, sub_category_score = scorer.score(reference, prediction_json, prediction_json.keys())
File "./caption-eval\cocoeval.py", line 64, in score
print('tokenization...')
OSError: [WinError 1] 函数不正确。
Hey @tgc1997
Thanks for providing the implementations of such an awesome work!!!
I wanted to know how does one go about using the pre-trained models for inferencing on raw custom videos?
Sir, can you share the link to process the text features and from which you have generated a caption.pkl file.
How to apply the code to my own dataset? Could you please provide the code about feature extraction?
Hi, Ganchao!
i have difficulty in reprodcing the experiment results for msr-vtt.
i have executed the project on msr-vtt several times and always got unideal results.
the cider scores just fluctuate from 45 to 46.5 which is far from the results i.e. 49.6 reported in the paper.
would it be convenient for u to share the random seed values set in ur experiments for msr-vtt with me?
training on msr-vtt is too time-consuming, 6 days or so when using a single gpu.
looking forward to ur help, thanks!
Obtained this error when running evaluate.py and train.py
May I know how to solve this issue?
I use 8 GPU with 32batchsize, I trained 3epoch whiching need 11 hours.
how long did you use to train 20epoch
thanks for your work!
hi! Ganchao here is a bug report.
i find out an error which might lead to inaccurate model reproduction or training results when i debug and reproduce the project.
the att_size=1024 set in run command console does not work. the reason is as follows:
despite the initial att_size parameter inside in initialize function of the SoftAttention & GumbelAttention class is opt.att_size(=1024), all of the att_size parameters within the reference instances actually are opt.hidden_size (=512 for msvd dataset / =1300 for msr-vtt dataset).
related code lines:
Line 18 in 14a9eff
Line 45 in 14a9eff
Line 171 in 14a9eff
Line 175 in 14a9eff
Line 207 in 14a9eff
Line 211 in 14a9eff
Line 245 in 14a9eff
Line 285 in 14a9eff
Line 287 in 14a9eff
Please provide a valid link. Thanks
According to the code, the hidden size is set to 1300 instead of widely-used 1024 or 2048. What is the main concern in this point?
awesome work!
when i reproduce the results you report in this repository (i.e. cider metric score is 97.8 on msvd dataset), errors indicating size mismatch for the whole Capmodel occurred as running evaluate.py with your pretrained file results/msvd_model/msvd_best_cider.pth.
e. g.
Runtime error: Error(s) in loading state_dictionary for CapModel:
size mismatch for encoder.bi_lstm1.weight_it_l0: copying a parameters with shape torch.Size([2048,1000]) from checkpoint, the shape in current model is torch.Size([5200,1000]).
size mismatch ……
size mismatch ……
it seems like you have modified the model while don't update the msvd_best_cider.pth.
if you do so please let me know
and i would appreciate it if you provide the new version PTH file so that i can reproduce the results you report in this repository.
by the way why the final high results was not published in the paper?
thanks!
when I run sample.py line 102, in
net.load_state_dict(torch.load(opt.model_pth_path))
RuntimeError: Error(s) in loading state_dict for CapModel:
Unexpected key(s) in state_dict: "decoder.module_selection.loc_fc.weight", "decoder.module_selection.loc_fc.bias", "decoder.module_selection.rel_fc.weight", "decoder.module_selection.rel_fc.bias", "decoder.module_selection.func_fc.weight", "decoder.module_selection.func_fc.bias", "decoder.module_selection.module_attn.wh.weight", "decoder.module_selection.module_attn.wh.bias", "decoder.module_selection.module_attn.wv.weight", "decoder.module_selection.module_attn.wv.bias", "decoder.module_selection.module_attn.wa.weight".
can you help me thank you very much
Which directory is this msr-vtt_model.pth in?
Hi,tgc! I tried using Torch's fasterrcnn_resnet50_fpn pre-trained model to extract the region_features of the video, but found that the feature shapes I extracted were only [823, 4], which is far from [26, 36, 2048] and [26, 36, 5] in the dataset you provided. What does the extra dimension mean, or what do these three dimensions mean respectively?
I wonder that is it feasible to use Torchvision's fasterrcnn_resnet50_fpn model to extract features without using caffe's Fast R-CNN model?The sizeof features extracted using Torchvision's fasterrcnn_resnet50_fpn model is significantly insufficient.How can I extract more features and accurate feature dimensions that meet the requirements?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.