Code Monkey home page Code Monkey logo

dense-video-captioning-pytorch's Introduction

Dense Video Captioning

PWC

Code for SYSU submission to ActivityNet Challenge 2020 (Task2: Dense Video Captioning). Our approach follows a two-stage pipeline: first, we extract a set of temporal event proposals; then we propose a multi-event captioning model to capture the event-level temporal relationships and effectively fuse the multi-modal information.

We won the 2nd place and the technical paper is available at arxiv.

Environment

  1. Python 3.6.2
  2. CUDA 10.0, PyTorch 1.2.0 (may work on other versions but has not been tested)
  3. other modules, run pip install -r requirement.txt

Prerequisites

  • ActivityNet video features. We use TSN features following this repo. You can follow the "Data Preparation" section to download feature files, then decompress and move them into ./data/resnet_bn.

  • Download annotation files and pre-generated proposal files from Google Drive, and place them into ./data. For the proposal generation, please refer to DBG and ESGN.

  • Build vocabulary file. Run python misc/build_vocab.py.

  • (Optional) You can also test the code based on C3D feature. Download C3D feature files (sub_activitynet_v1-3.c3d.hdf5) from here. Convert the h5 file into npy files and place them into ./data/c3d.

Usage

  • Training
# first, train the model with cross-entropy loss 
cfg_file_path=cfgs/tsrm_cmg_hrnn.yml
python train.py --cfg_path $cfg_file_path

# Afterward, train the model with reinforcement learning on enlarged training set
cfg_file_path=cfgs/tsrm_cmg_hrnn_RL_enlarged_trainset.yml
python train.py --cfg_path $cfg_file_path

training logs and generated captions are in this folder ./save.

  • Evaluation
# evaluation with ground-truth proposals (small val set with 1000 videos)
result_folder=tsrm_cmg_hrnn_RL_enlarged_trainset
val_caption_file=data/captiondata/expand_trainset/val_1.json
python eval.py --eval_folder $result_folder --eval_caption_file $val_caption_file

# evaluation with learnt proposals (small val set with 1000 videos)
result_folder=tsrm_cmg_hrnn_RL_enlarged_trainset
lnt_tap_json=data/generated_proposals/tsn_dbg_esgn_valset_num4717.json
python eval.py --eval_folder $result_folder --eval_caption_file $val_caption_file --load_tap_json $lnt_tap_json

# evaluation with ground-truth proposals (standard val set with 4917 videos)
result_folder=tsrm_cmg_hrnn
python eval.py --eval_folder $result_folder

# evaluation with learnt proposals (standard val set with 4917 videos)
result_folder=tsrm_cmg_hrnn
lnt_tap_json=data/generated_proposals/tsn_dbg_esgn_valset_num4717.json
python eval.py --eval_folder $result_folder --load_tap_json $lnt_tap_json
  • Testing
python eval.py --eval_folder tsrm_cmg_hrnn_RL_enlarged_trainset \
 --load_tap_json data/generated_proposals/tsn_dbg_esgn_testset_num5044.json\
 --eval_caption_file data/captiondata/fake_test_anno.json

We also provide the config files of some baseline models. Please see this folder ./cfgs for details.

Pre-trained model

We provide a pre-trained model from here. You can directly download model-best-RL.pth and info.json and place them into ./save/tsrm_cmg_hrnn_RL_enlarged_trainset, then run the above code for fast evaluation. On the small validation set (1000 videos), this model achieves a 14.51/10.14 METEOR with ground-truth/learnt proposals.

Related project

PDVC (ICCV 2021): A simple yet effective dense video captioning method, which integrates the proposal generation and captioning generation into a parallel decoding manner.

Citation

If you find this repo helpful to your research, please consider citing:

@article{wang2020dense,
  title={Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020},
  author={Wang, Teng and Zheng, Huicheng and Yu, Mingjing},
  journal={arXiv preprint arXiv:2006.11693},
  year={2020}
}

References

dense-video-captioning-pytorch's People

Contributors

ttengwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dense-video-captioning-pytorch's Issues

evaluation scores are low

Hi, I have a question about the result in the paper. When I tried to replicate, my result is lower than what is shown in the paper. My assumption is the author multiplied what he got with 1000. Can anyone help me to clarify this?
Thank you in advance

my result:
{'Bleu_1': 0.05203881419889433, 'Bleu_2': 0.02532179465691948, 'Bleu_3': 0.009912868169621145, 'Bleu_4': 0.002988311654355243, 'METEOR': 0.02982898949811799, 'ROUGE_L': 0.05281023682433468, 'CIDEr': 0.054126791927464786, 'Recall': 0.2033760423022168, 'Precision': 0.2033760423022168, 'tiou': 0.6}

How to preprocess data for your network?

Hey,
I would like to modify your code for my own dataset with already generated custom captions.
The clips show every time one Person doing one out of nine actions. The captions describe the gender, the action and the cloth, the person is wearing.
There are 100 different captions for each Video. They are stored as strings in an extra json-file.
The Videos are till now framewise png's with json's describing the visible clothes as well as the gender of the Person in each frame.

Can you help me, how I should prepare my dataset and your code, that I can use it as input for your video-captioning-net?

Thank you in advance.

Training using c3d features

Hi, Thank you for the amazing work. I want to check if the current code supports training using the official c3d features, "which were extracted every 8 frame"?
Thank you.

Evalute Error

Hi,
I tried to evaluate with this line from your README:
python eval.py --eval_folder $result_folder --eval_caption_file $val_caption_file --load_tap_json $lnt_tap_json,
but I get an error from pycocoevalcap.
The error I get is:
score, scores = scorer.compute_score(gts[vid_id], res[vid_id])
File "./densevid_eval3/coco-caption/pycocoevalcap/meteor/meteor.py", line 44, in compute_score
score = float(self.meteor_p.stdout.readline().strip())
ValueError: could not convert string to float:

I tried to debug using a print command. score contains:
[b'', b'', b'', b'', b'']

Have you ever seen this error?

Thank you in advance.

How to use pre-trained model with .mp4 file?

Hello! Really great work!
Could you please help me? I'm trying to use your pre-trained model with my custom videos, but I don't understand how to use it with .mp4 video files.

Thanks in advance!

Training duration/batch size

Hi. First of all, thanks for publishing your implementation code!

I'm quite unexperienced in training dense video captioining models and therefore I have some questions.

  1. You mention to first train the model with tsrm_cmg_hrnn.yml, and afterwards with tsrm_cmg_hrnn_RL_enlarged_trainset.yml. Was this also the way you did it for the model that you submitted for the challenge? If not, how you did it?

  2. Since both mentioned models contain a hrnn, I'm only allowed to train them with a batch size of 1, right? Meaning I will just use a tiny amount of my gpu which should not be desirable?
    However, one epoch is done quite fast in a matter of some minutes. Compard to the model u linked this is much quicker, where one epoch takes around 20 to 30 minutes for me. Why is that?

Generally, I really would like to know for which batch size which training duration / how many epochs I can expect to do.

ps: These questions shouldn't be an issue but rather a personal mail to you, right? :)

evaluation doesnt function

Hi, I have a question about the evaluation.

I tried the testing, and I didnt get errors, but the result I get is every time:

'METEOR', [0.0, 0.0, 0.0, 0.0]), ('Recall', [0.0, 0.0, 0.0, 0.0]), ('Precision', [0.0, 0.0, 0.0, 0.0]), ('tiou', [0.3, 0.5, 0.7, 0.9])

Can you help me to understand, what I am doing wrong?

Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.