Code Monkey home page Code Monkey logo

grounded-video-description's Introduction

Grounded Video Description

ActivityNet Entities Object Localization (Grounding) Challenge joins the official ActivityNet Challenge as a guest task this year! See here on how to participate.

This repo hosts the source code for our paper Grounded Video Description. It supports ActivityNet-Entities dataset. We also have code that supports Flickr30k-Entities dataset, hosted at the flickr_branch branch.

teaser results

Note: [42] indicates Masked Transformer

Quick Start

Preparations

Follow the instructions 1 to 3 in the Requirements section to install required packages.

Download everything

Simply run the following command to download all the data and pre-trained models (total 216GB):

bash tools/download_all.sh

Starter code

Run the following eval code to test if your environment is setup:

python main.py --batch_size 100 --cuda --num_workers 6 --max_epoch 50 --inference_only \
    --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 \
    --seq_length 20 --language_eval --eval_obj_grounding --obj_interact

(Optional) Single-GPU training code for double-check:

python main.py --batch_size 20 --cuda --checkpoint_path save/gvd_starter --id gvd_starter --language_eval

You can now skip to the Training and Validation section!

Requirements (Recommended)

  1. Clone the repo recursively:
git clone --recursive [email protected]:facebookresearch/grounded-video-description.git

Make sure all the submodules densevid_eval and coco-caption are included.

  1. Install CUDA 9.0 and CUDNN v7.1. Later versions should be fine, but might need to get the conda env file updated (e.g., for PyTorch).

  2. Install Miniconda (either Miniconda2 or 3, version 4.6+). We recommend using conda environment to install required packages, including Python 3.7 or 2.7, PyTorch 1.1.0 etc.:

MINICONDA_ROOT=[to your Miniconda root directory]
conda env create -f cfgs/conda_env_gvd_py3.yml --prefix $MINICONDA_ROOT/envs/gvd_pytorch1.1
conda activate gvd_pytorch1.1

Note that there have been some breaking changes since PyTorch 1.2 (e.g., bitwise not on torch.bool/torch.uint8 and masked_fill_). This code base could potentially work with PyTorch 1.2+ with corresponding changes made.

Replace cfgs/conda_env_gvd_py3.yml with cfgs/conda_env_gvd.yml for Python 2.7.

  1. (Optional) If you choose to not use download_all.sh, be sure to install JAVA and download Stanford CoreNLP for SPICE (see here). Also, download and place the reference file under coco-caption/annotations. Download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools directory.

Data Preparation

Updates on 04/15/2020: Feature files for the hidden test set, used in ANet-Entities Object Localization Challenge 2020, are available to download (region features and frame-wise features). Make sure you move the additional *.npy files over to your folder fc6_feat_100rois and rgb_motion_1d, respectively. The following files have been updated to include the hidden test set or video IDs: anet_detection_vg_fc6_feat_100rois.h5, anet_entities_prep.tar.gz, and anet_entities_captions.tar.gz.

Download the preprocessed annotation files from here, uncompress and place them under data/anet. Or you can reproduce them all using the data from ActivityNet-Entities repo and the preprocessing script prepro_dic_anet.py under prepro. Then, download the ground-truth caption annotations (under our val/test splits) from here and same place under data/anet.

The region features and detections are available for download (feature and detection). The region feature file should be decompressed and placed under your feature directory. We refer to the region feature directory as feature_root in the code. The H5 region detection (proposal) file is referred to as proposal_h5 in the code. To extract feature for customized dataset (or brave folks for ANet-Entities as well), refer to the feature extraction tool here.

The frame-wise appearance (with suffix _resnet.npy) and motion (with suffix _bn.npy) feature files are available here. We refer to this directory as seg_feature_root.

Other auxiliary files, such as the weights from Detectron fc7 layer, are available here. Uncompress and place under the data directory.

Training and Validation

Modify the config file cfgs/anet_res101_vg_feat_10x100prop.yml with the correct dataset and feature paths (or through symlinks). Link tools/anet_entities to your ANet-Entities dataset root location. Create new directories log and results under the root directory to save log and result files.

The example command on running a 8-GPU data parallel job:

For supervised models (with self-attention):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml \
    --batch_size $batch_size --cuda --checkpoint_path save/$ID --id $ID --mGPUs \
    --language_eval --w_att2 $w_att2 --w_grd $w_grd --w_cls $w_cls --obj_interact | tee log/$ID

For unsupervised models (without self-attention):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml \
    --batch_size $batch_size --cuda --checkpoint_path save/$ID --id $ID --mGPUs \
    --language_eval | tee log/$ID

Arguments: batch_size=240, w_att2=0.05, w_grd=0, w_cls=0.1, ID indicates the model name.

(Optional) Remove --mGPUs to run in single-GPU mode.

Pre-trained Models

The pre-trained models can be downloaded from here (1.5GB). Make sure you uncompress the file under the save directory (create one under the root directory if not exists).

Inference and Testing

For supervised models (ID=anet-sup-0.05-0-0.1-run1):

(standard inference: language evaluation and localization evaluation on generated sentences)

python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml --batch_size 100 --cuda \
    --num_workers 6 --max_epoch 50 --inference_only --start_from save/$ID --id $ID \
    --val_split $val_split --densecap_references $dc_references --densecap_verbose --seq_length 20 \
    --language_eval --eval_obj_grounding --obj_interact \
    | tee log/eval-$val_split-$ID-beam$beam_size-standard-inference

(GT inference: localization evaluation on GT sentences)

python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml --batch_size 100 --cuda \
    --num_workers 6 --max_epoch 50 --inference_only --start_from save/$ID --id $ID \
    --val_split $val_split --seq_length 40 --eval_obj_grounding_gt --obj_interact \
    --grd_reference $grd_reference | tee log/eval-$val_split-$ID-beam$beam_size-gt-inference

For unsupervised models (ID=anet-unsup-0-0-0-run1), simply remove the --obj_interact option.

Arguments: dc_references='./data/anet/anet_entities_val_1.json ./data/anet/anet_entities_val_2.json', grd_reference='tools/anet_entities/data/anet_entities_cleaned_class_thresh50_trainval.json' val_split='validation'. If you want to evaluate on the test splits, set val_split to 'testing' or 'hidden_test', dc_references (look for anet_entities_test_1.json and anet_entities_test_2.json and this only supports 'testing'), and grd_reference (the skeleton files *testing*.json and *hidden_test*.json) accordingly. Then,submit the object localization output files under results to the eval server. Note that the eval server here is for general purposes. The servers designed for the CVPR'20 challenge is instead here.

You need at least 9GB of free GPU memory for the evaluation.

Reference

Please acknowledge the following paper if you use the code:

@inproceedings{zhou2019grounded,
  title={Grounded Video Description},
  author={Zhou, Luowei and Kalantidis, Yannis and Chen, Xinlei and Corso, Jason J and Rohrbach, Marcus},
  booktitle={CVPR},
  year={2019}
}

Acknowledgement

We thank Jiasen Lu for his Neural Baby Talk repo. We thank Chih-Yao Ma for his helpful discussions.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Portions of the source code are based on the Neural Baby Talk project.

grounded-video-description's People

Contributors

luoweizhou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grounded-video-description's Issues

Error running main.py

Hello!When I ran Main.py using your pre-training model, the configuration and errors were as follows:
D:\Anaconda2\envs\py36\python.exe D:/grounded-video-description-master/main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml --batch_size 1 --cuda --num_workers 0 --max_epoch 1 --inference_only --start_from save/anet-sup-0.05-0-0.1-run3 --id anet-sup-0.05-0-0.1-run3 --val_split testing --densecap_references anet_entities_test_1.json anet_entities_test_2.json --densecap_verbose --seq_length 20 --language_eval --eval_obj_grounding --obj_interact
D:/grounded-video-description-master/main.py:525: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
options_yaml = yaml.load(handle)
DataLoader loading json file: data/anet/dic_anet.json
vocab size is 4905
DataLoader loading input file: data/anet/cap_anet_trainval.json
DataLoader loading grounding file: tools/anet_entities/data/anet_entities_cleaned_class_thresh50_trainval.json
DataLoader loading proposal file: data/anet/anet_detection_vg_fc6_feat_100rois.h5
assigned 0 segments to split training
DataLoader loading json file: data/anet/dic_anet.json
vocab size is 4905
DataLoader loading input file: data/anet/cap_anet_trainval.json
DataLoader loading grounding file: tools/anet_entities/data/anet_entities_cleaned_class_thresh50_trainval.json
DataLoader loading proposal file: data/anet/anet_detection_vg_fc6_feat_100rois.h5
assigned 0 segments to split testing
THCudaCheck FAIL file=C:/ProgramData/Miniconda3/conda-bld/pytorch_1524543037166/work/aten/src/THC/THCTensorRandom.cu line=25 error=30 : unknown error
Traceback (most recent call last):
File "D:/grounded-video-description-master/main.py", line 576, in
segs_feat = segs_feat.cuda()
RuntimeError: cuda runtime error (30) : unknown error at C:/ProgramData/Miniconda3/conda-bld/pytorch_1524543037166/work/aten/src/THC/THCTensorRandom.cu:25

Process finished with exit code 1

Looking forward to your reply!

error when running the Starter code with GPU

Hello, Luowei, thanks for your code ,its is quit helpful ! When I download all the dataset, configure the conda environment and I have this problem when I running the starter code with cuda able:

428 classes have the associated lemma word!
Traceback (most recent call last):
File "main.py", line 697, in
lang_stats = eval(epoch, opt)
File "main.py", line 358, in eval
input_ppls, dummy, dummy, ppls_feat, dummy, sample_idx, pnt_mask, 'sample', eval_opt)
File "/home/xddz/anaconda3/envs/gvd_pytorch1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/big_data/pgc/code/grounded-video-description/misc/model.py", line 233, in forward
seq, seqLogprobs, att2, sim_mat = self._sample(segs_feat, ppls, num, ppls_feat, sample_idx, pnt_mask, eval_opt)
File "/big_data/pgc/code/grounded-video-description/misc/model.py", line 510, in _sample
F.layer_norm(self.seg_info_embed(num[:, 3:7].float()), [self.seg_info_size])), dim=-1)
File "/home/xddz/anaconda3/envs/gvd_pytorch1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/xddz/anaconda3/envs/gvd_pytorch1.1/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/xddz/anaconda3/envs/gvd_pytorch1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/xddz/anaconda3/envs/gvd_pytorch1.1/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 92, in forward
return F.linear(input, self.weight, self.bias)
File "/home/xddz/anaconda3/envs/gvd_pytorch1.1/lib/python3.7/site-packages/torch/nn/functional.py", line 1406, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THC/THCBlas.cu:259

When I run the cod without GPU, it works well. so it really confuse me. My gpu is 2080ti and cuda version is 10.1.Looking forward to your reply, thanks!

How to infer from new videos ?

Hi @LuoweiZhou,
My question is two-fold:
1)
I downloaded the pre-trained models and I tried first to run the example you provided for inference. I got this error:
IOError: [Errno 2] No such file or directory: u'data/anet/rgb_motion_1d/K6Tm5xHkJ5c_resnet.npy'
Below is the point where the error arises:
Loading the model save/anet-unsup-0-0-0-run1/model-best.pth...
Finetune param: ctx2pool_grd.0.weight
Finetune param: ctx2pool_grd.0.bias
Finetune param: vis_embed.0.weight

I verified that the file is missing but I do not know how to get it ( I saw a similar issue but I could not proceed with the provided answer as it was unclear for me)
2) I am wondering how can I use your code in order to infer from my own videos. Can you please guide me ?
Thanks in advance

Steps to generate a caption from a video file

Hi, can someone tell the steps to do/ python script to run in order to generate the caption of a video file I have.
What inputs do I need to prepare and which function to invoke ? I don't need any evaluation script to run for this case, just the prediction

evaluate

I'm receiving an error on this line in main.py:
from evaluate import ANETcaptions

Is this function missing or does have a different source?

low performance

I trained the grounded video caption model in my own experimental environment with the hyper-parameters, datasets, features, proposals and codes given. But I get a low CIDEr performance.
In the paper, we are expected to get CIDEr more than 45, but in fact, I only get 18 after 15 epochs. I think the model is converged because the CIDEr is stable in the validation dataset.
I think the difference between the test results and the performance of the paper is too large. This really bothers me. Are there any problems doing this experiment?
Thanks for your time.

Why 100% teacher forcing?

Based on what I get from the code, the teacher forcing ratio is set to 100% in both training and validation modes. Shouldn't the teacher forcing ratio to be set to 0 during validation? Otherwise, we are not really validating the model. Also, I don't understand why the teacher forcing should be 100% for training either, just because it makes the model perform better?

Regards,
Ali

Pre-trained model GVD on test-split

Hello All,

I am trying to reproduce the results of ActivityNet-Entities using Pre-trained model for "test-split",
But I am getting the following results :

**--------------------------------------------------------------------------------
The overall localization accuracy is nan

Number of videos in the reference: 0, number of videos in the submission: 1302
Number of groundable objects in this split: 0

The overall localization accuracy is nan

Results Summary (GT sent):
The averaged attention / grounding box accuracy across all classes is: nan / nan
The averaged classification accuracy across all classes is: 0.0000
**

Can please someone explains the reason for 0 accuracy, whether I am missing some parameter or file to pass?

Where is "caption_flickr30k.json"

Hi, I'm using the flickr30k branch to try this repo.
in the "misc/utils.py" file, line 349,
annFile = 'tools/coco-caption/annotations/caption_flickr30k.json'
when I test the environment, the error shows that I don't have the caption_flickr30k.json file. Can I ask that where is this file?

Another question is , in the "prepro/prepro_det.py" file, line 31
det_file = json.load(open('data/flickr30k/flickr30k_detection.json'))
the error shows that I don't have data/flickr30k/flickr30k_detection.json.
BTW, the same file, line 15, I think it should be changed to 'flickr30k' instead of 'coco'

main.py error

When I run following command come accross the issue:

python main.py --batch_size 100 --cuda --num_workers 6 --max_epoch 50 --inference_only \
    --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 \
    --seq_length 20 --language_eval --eval_obj_grounding --obj_interact
index: 420, similarity: 0.58, swimming, number
index: 424, similarity: 0.46, he, camera
Loading the model save/anet-sup-0.05-0-0.1-run1/model-best.pth...
Finetune param: vis_embed.0.weight
Finetune param: ctx2pool_grd.0.weight
Finetune param: ctx2pool_grd.0.bias
Use adam as optmization method
428 classes have the associated lemma word!
Traceback (most recent call last):
  File "main.py", line 696, in <module>
    lang_stats = eval(epoch, opt)
  File "main.py", line 333, in eval
    data = data_iter_val.next()
  File "/home/louyu/miniconda3/envs/torch11/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/louyu/miniconda3/envs/torch11/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
AssertionError: Traceback (most recent call last):
  File "/home/louyu/miniconda3/envs/torch11/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/louyu/miniconda3/envs/torch11/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/media/louyu/DATA/grounded-video-description/misc/dataloader_anet.py", line 193, in __getitem__
    assert(num_proposal == region_feature.shape[0])
AssertionError

I am using RTX 2070 so I am using cuda 10.2 and can't install cuda 9, so any ideas about this issue?

Feature Extraction

Hi, I am confused about the feature extraction process.
How can I use this model on other videos ?

Broken dependencies

This project seems abandoned. The last commit dates 3 years back and the dependencies are now very outdated thus not resolvable anymore.

Attention and Grounding Accuracy Metrics

Hi and thanks for this great work.

I am curious about the metrics that you used to evaluate the attention and grounding performance. As stated in the table 2 of the GVD paper, the attention and grounding accuracies are the object localization accuracies for attention and grounding. Is "object localization" accuracy the same as IoU ratios in this case? Can you please provide some links/figures to the actual definition of attention and grounding accuracy? To the best of my knowledge, there was not a clear one in the paper.

Regards,
Ali

Masked Transformer in this repo

Hi. Thanks for this amazing repository.

I was looking at the code for masked transformer and in the readme, I find that the "captioning-only" model can be obtained by configuring --att_model in this implementation.

Could you clarify how one would go about doing that?

Uniform sampling script?

Hi Louwei, thanks for the great work and code!

I really appreciate it if you provide the uniform sampling script that you use to take some frames from the videos. I need to extract features directly from the frames, and I want my pipeline to be in accordance to what you have done. Thank you so much :)

Does process exist to use pre-trained model on my own video?

Hello, thanks for share the great work!

I want to use flickr pre-train model on my own video.
so i read some issues (#5 #12 #32)

But i couldnt find an official process..

  1. Could you tell me your process?
  2. If you have time
    How about adding the process and the code you used to the readme?

How to perform captioning on action proposals?

Hi, thank you for sharing this repository.

My goal is to use your model to generate captions on Activity-Net action proposals.

The dataset is the same, so i don't think to need to retrain the model, however i should need to generates the region features and detections using Detectron. Right?
Is there an easy way, a script, to do it?

I'd saw that you kindly provide the code here:
https://github.com/LuoweiZhou/detectron-vlp

Should I download "RGB frames extracted at 5FPS" provided by ActivityNet, segment them by mine action proposals timestamp, uniformly sample 10 frames for each segment and than use the extract_features.py script, that is into your detectron-vlp repository, to extract region features?

Thanks in advance.

i can't fine the grd_reference file

In utils.py, where is the grd_reference file about default='tools/anet_entities/data/anet_entities_cleaned_class_thresh50_trainval.json'
Please tell me ~~~~
Thank you very much.
i can't find it where it is

Image Data and Visualization

Hi!

I'm trying to visualize some of the results by using "--vis_attn", but I am unsure where the image_path data is stored. Additionally, are there any other visualization scripts you used to visualize results?

Much appreciated!

Jeff

About default loss weights

Thanks for opening source code.
The `w_att2=0.05, w_grd=0, w_cls=0.1' provided in this project.
Does this mean that Grounding loss is not necessary in this project?

some feature file was lossed in fc6_feat_100rois and rgb_motion_1d

i write some code to test which file was lossed , and i list them below.

i can sure the file which i download is complete

so i wish you can check your upload file when you have free time,

DataLoader loading json file: data/anet/dic_anet.json
vocab size is 4905
DataLoader loading json file: data/anet/cap_anet_trainval.json
DataLoader loading json file: data/anet/anet_captions_all_splits.json
DataLoader loading proposal file: data/anet/anet_detection_vg_fc6_feat_100rois.h5
a file lossed: data/anet/fc6_feat_100rois/v_---9CpRcKoU_segment_00.npy
a file lossed: data/anet/fc6_feat_100rois/v_---9CpRcKoU_segment_01.npy
a file lossed: data/anet/fc6_feat_100rois/v_---9CpRcKoU_segment_02.npy
a file lossed: data/anet/fc6_feat_100rois/v_cxIfpBvuk0E_segment_08.npy
a file lossed: data/anet/fc6_feat_100rois/v_cxIfpBvuk0E_segment_09.npy
a file lossed: data/anet/fc6_feat_100rois/v_cxIfpBvuk0E_segment_10.npy
a file lossed: data/anet/fc6_feat_100rois/v_kBUDMFgWO9I_segment_03.npy
a file lossed: data/anet/rgb_motion_1d/xeOHoiH-dmo_bn.npy
a file lossed: data/anet/rgb_motion_1d/xeOHoiH-dmo_bn.npy
a file lossed: data/anet/rgb_motion_1d/xeOHoiH-dmo_bn.npy
a file lossed: data/anet/fc6_feat_100rois/v_tD30qafrkhM_segment_10.npy
a file lossed: data/anet/fc6_feat_100rois/v_tD30qafrkhM_segment_11.npy
a file lossed: data/anet/fc6_feat_100rois/v_tD30qafrkhM_segment_12.npy
a file lossed: data/anet/fc6_feat_100rois/v_tD30qafrkhM_segment_13.npy
a file lossed: data/anet/fc6_feat_100rois/v_vdYFwqfqgJA_segment_02.npy
a file lossed: data/anet/fc6_feat_100rois/v_sjyZWmvTGA4_segment_00.npy
a file lossed: data/anet/fc6_feat_100rois/v_sjyZWmvTGA4_segment_01.npy
a file lossed: data/anet/fc6_feat_100rois/v_sjyZWmvTGA4_segment_02.npy
a file lossed: data/anet/fc6_feat_100rois/v_sjyZWmvTGA4_segment_03.npy
a file lossed: data/anet/fc6_feat_100rois/v_sjyZWmvTGA4_segment_04.npy
a file lossed: data/anet/rgb_motion_1d/iVVatZsgnGo_bn.npy
a file lossed: data/anet/rgb_motion_1d/iVVatZsgnGo_bn.npy
a file lossed: data/anet/rgb_motion_1d/iVVatZsgnGo_bn.npy
a file lossed: data/anet/fc6_feat_100rois/v_E33xUgVqEH0_segment_01.npy
training videos: 10005
assigned 37397 segments to split training
DataLoader loading json file: data/anet/dic_anet.json
vocab size is 4905
DataLoader loading json file: data/anet/cap_anet_trainval.json
DataLoader loading json file: data/anet/anet_captions_all_splits.json
DataLoader loading proposal file: data/anet/anet_detection_vg_fc6_feat_100rois.h5
a file lossed: data/anet/rgb_motion_1d/j73Wh1olDsA_bn.npy
a file lossed: data/anet/rgb_motion_1d/j73Wh1olDsA_bn.npy
a file lossed: data/anet/rgb_motion_1d/j73Wh1olDsA_bn.npy
validation videos: 2459
assigned 8771 segments to split validation

Beam search broken

There are a few problems I found with beam search simply not working. The first is the inclusion of two extra parameters in the call to core:

rnn_output, state, att2_weight, att_h, _, _ = self.core(xt, beam_fc_feats, beam_conv_feats,

core is being called with two extra parameters, causing beam search to fail. The extra parameters appear to be Variable(beam_pool_feats.data.new(beam_size, rois_num).fill_(0)) and self. I've looked at the rest of the code but don't see any variant core models that would take this extra parameters, and self doesn't make sense regardless.

Even after removing those two variables (which isn't a great idea, as I assume there is a reason for the Variable, the beam search function itself returns an incorrect number of params:

return seq.t(), seqLogprobs.t(), att2.t()

It returns 3, whereas the normal _sample function returns 4:

return seq, seqLogprobs, att2_weights, sim_mat_static

I'd love to understand better how beam search should work, and how I'd go about fixing it. Thanks!

main.py

Hello, in main.py:
python main.py --batch_size 100 --cuda --num_workers 6 --max_epoch 50 --inference_only --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 --seq_length 20 --language_eval --eval_obj_grounding --obj_interact

I also have some problems:
image
Seems like something wrong with data, could you please give me some advice?
Thank you!

AssertionError

When I run following command for inference, I get an AssertionError

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop.yml --batch_size 100 --cuda \
    --num_workers 6 --max_epoch 50 --inference_only --start_from save/anet-sup-0.05-0-0.1-run1 --id anet-sup-0.05-0-0.1-run1 \
    --val_split validation --densecap_references ./data/anet/anet_entities_val_1.json  --densecap_verbose --seq_length 20 \
    --language_eval --eval_obj_grounding --obj_interact \
    | tee log/eval-validation-anet-sup-0.05-0-0.1-run1-beam 5-standard-inference

image
Anyone else has the same error? how to solve that? Thanks

Import "evaluate" could not be resolved

In main.py file,there are two lines "from evaluate import ANETcaptions" and "from eval_grd_anet_entities import ANetGrdEval"."eval_grd_anet_entities.py" and "evaluate.py" don't exist in root directory,so "ANETcaptions" and "eval_grd_anet_entities" can't be imported.Did you missed these two files?

Viewing results

Hi,

where could I see the sentences generated by the model?

I downloaded the pretrained models. Using the evaluation code, I can run the evaluation script. However, I don't see the caption results generated by the model. What folder would it be in? or is it not saved?

Thanks

Nikky

Pre-trained model for GVD

Hi. Thank you for the amazing repository. I see that the arguments for the pretrained models are
w_att2=0.05, w_grd=0, w_cls=0.1. If I understand correctly, this would correspond to the third-last row in Table 2 of your paper (https://arxiv.org/pdf/1812.06587.pdf). Is that correct?

If so, could you share the pretrained model for GVD (last row in Table 2 of your paper).

rgb_motion_1d features

Hi, I can't figure out with this features.
The training, validation and test features, both the visuals "_resnet.npy" and the motions "_bn.npy", are the same used in densecap; same values and same checksums.

So I don't understand why the features that you provided with the update of 04/15/2020, for the hidden test-set, are different in shape and values compared to those used in densecap.

It's important for me to understand this because I have extended your model to face the dense video captioning challenge on ActivityNet but I got a very low score compared to what I got for the validation set.

Why mask() compares targets against 1 rather than 0

Hi, I'm wondering for the function mask() in misc.transformer why it's comparing targets against 1? since the padded (both at the front and end of the sentence) value is 0, shouldn't it compare it against 0? Thanks.

proposals vs. bboxs

Hi Luowei,

Thanks for sharing this wonderful codebase and dataset! I see there're both proposal and bounding box annotations provided. I'm wondering what's the difference between these two? Thanks!

In inference-only mode, how to split my own video into several segments to represent each event?

In the paper:

In video description, we augment the global feature with segment positional information (i.e., total number of segments, segment index, start time and end time), which is empirically important.

I confused where the segment information start time and end time come from.

I have extracted features from my own video already, and can generate sentence. But it just generates one sentence for one video. So, how to split several events for one video? OR, how to get the start and end time information to represent each event from a new video?

question about use pre-train model on my own video

Hello, thanks for share the great work and it is very helpful ! When I use your pre-train model to generate the description for my own video. I use the code you offer to extract the feature(_segment.npy, anet_detection_vg_fc6_feat_100rois.h5, and bn.npy and resnet.npy). However, when i use it to generate the caption, it says 'TypeError: only size-1 arrays can be converted to Python scalars' I find out it is the difference between the anet_detection_vg_fc6_feat_100rois.h5 you offer and the
anet_detection_vg_fc6_feat_100rois.h5 file I generate with the code in detectron-vlp. the dimension of dets_num, dets_labels and others in the detectron-vlp is different from the .h5 file you offer. https://github.com/LuoweiZhou/detectron-vlp/blob/b9140d298538703205fd2c0421b06c4b40e00018/tools/extract_features_gvd_anet.py#L221
looking forward to your reply. thx!

Training Time

Thank you for the amazing repository. Could you share an estimate of how long the training process takes? Also, I noticed that during the training of a few batches, the gpu utilization is 0. Is there any particular reason for this? I haven't able to pinpoint the cause on my side.

Language evaluation: segments in test_2 and val_2

I don't understand the reason because, for the language part, you evaluate your model consedering also segments in the "anet_entities_val_2.json" and "anet_entities_test_2.json" files, while generated captions are only about the segments in "anet_entities_val_1.json" and "anet_entities_test_1.json".
I can't can't make out the sense of evaluate a model on elements that it didn't processed, resulting in loss in performance scores.

Can you help me to understand this?

Thank you very much for your effort in this work.

conda environment

hi, when I create python3 env, I got the following issue. Could you help me solve this prob?

Solving environment: failed
ResolvePackageNotFound:
  - python==3.7.6=h0371630_1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.