Code Monkey home page Code Monkey logo

howtocaption's Introduction

HowToCaption

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
arxiv

Download:

HowToCaptions dataset:

Vicuna-13B based captions:
HowToCaption dataset that we use in the paper can be found here ~1.5GB
Unfiltered HowToCaption with the corresponding scores based on BLIP model can be found here ~4.6GB

HowToCaptions-grounded dataset:

MiniGPT-4 based captions:
HowToCaption-grounded dataset can be found here ~1.5GB
Unfiltered HowToCaption-grounded dataset with corresponding scores based on BLIP model can be found here ~4.5GB

How To Use

How to use filtered HowToCaption or HowToCaption-grounded datasets:

  • Each file is a dictionary with video-ids as keys
  • For each video we provide ‘start’, ‘end’, and ‘text’ lists of the same lengths
  • ’start’ and ‘end’ correspond to starts and ends seconds of the clips in the video

To note:

  • ‘text’ is list of lists of strings as to the same position in the video can correspond several captions
  • Starting seconds in ‘start’ list are not ordered; however, ‘end’ seconds always correspond to ’start’ positions ordering

Example:

<<< HowToCaption[‘---39MFGZ-k’]   

{
'start': [12, 19, 29, 25, 55, 81, 82], 
'end': [20, 27, 37, 33, 63, 89, 90], 
'text': [
[‘Show how to unload a 12-gauge shotgun’], 
[‘Loading a 12-gauge shotgun’], 
[‘Demonstrating how to unload a 12-gauge shotgun', 'A better way to unload a gun’], 
[‘Putting another round into the gun', 'The danger of loading a gun the usual way’], 
[‘Loading the gun safely', 'Short stroke to load the gun', 'Loading the gun today’], 
[‘Lifting up the bar to extract rounds’], 
[‘Going forward and lifting up the bar to extract rounds'] 
}

How to use unfiltered HowToCaption or HowToCaption-grounded datasets:

The difference to standard HowToCaption dataset is that ‘text’ is list of lists of tuples of (string, score).

Example:

<<< HowToCaption[‘---39MFGZ-k’]

{
'start': [12, 19, 25, 29, 55, 54, 65, 81, 82, 105, 103], 
'end': [20, 27, 33, 37, 63, 62, 73, 89, 90, 113, 111], 
'text': [
[('Show how to unload a 12-gauge shotgun', 0.5699871778488159)], 
[('Loading a 12-gauge shotgun', 0.5876383185386658)], 
[('Unloading and removing a round from the chamber', 0.31276029348373413), ('Putting another round into the gun', 0.4805337190628052), ('The danger of loading a gun the usual way', 0.4611629843711853)],
[('Demonstrating how to unload a 12-gauge shotgun', 0.617999255657196), ('A better way to unload a gun', 0.5126216411590576)], 
[('Loading the gun safely', 0.539146363735199), ('Short stroke to load the gun', 0.5076732635498047), ('Loading the gun today', 0.4759426712989807)], 
[('Being nervous on camera', 0.3465729355812073), ('Nervousness on camera', 0.27738460898399353)], 
[('Extracting rounds by lifting up the bar', 0.41076189279556274)], 
[('Lifting up the bar to extract rounds', 0.4220432639122009)], 
[('Going forward and lifting up the bar to extract rounds', 0.42620745301246643)], 
[('A person is speaking and pointing out that there are no ramps present', 0.30187565088272095)], 
[('The speaker mentions that they can be found online', 0.30197498202323914), ('The speaker concludes the video by saying "WWE" and ending the video', 0.36031144857406616)]]
}

Acknowledgement

  • BLIP is the model for text-video encoder and score function
  • Vicuna is open source instructional LLM to generate HowToCaption dataset
  • MiniGPT-4 is open-source LLM with image conditioning to generate HowToCaption-grounded dataset

If you're using HowToCaption or HowToCaption-grounded dataset in your research or applications, please cite using this BibTeX:

@article{shvetsova2023howtocaption,
  title={HowToCaption: Prompting LLMs to Transform Video Annotations at Scale},
  author={Shvetsova, Nina and Kukleva, Anna and Hong, Xudong and Rupprecht, Christian and Schiele, Bernt and Kuehne, Hilde},
  journal={arXiv preprint arXiv:2310.04900},
  year={2023}
}

Licence:

HowToCaption and HowToCaption-grounded are based on Vicuna and MiniGpt-4 that are fine-tuned LLaMA and should be used under LLaMA's model license.
This repository is under Apache License.

howtocaption's People

Contributors

annusha avatar ninatu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.