Code Monkey home page Code Monkey logo

sevila's Introduction

Self-Chained Image-Language Model for Video Localization and Question Answering

teaser image

teaser image

teaser image

Code structure

# data & data preprocessing
./sevila_data

# pretrained checkpoints
./sevila_checkpoints

# SeViLA code
./lavis/

# running scripts for SeViLA localizer/answerer training/inference
./run_scripts

Setup

Install Dependencies

  1. (Optional) Creating conda environment
conda create -n sevila python=3.8
conda activate sevila
  1. build from source
pip install -e .

Download Pretrained Models

We pre-train SeViLA localizer on QVHighlights and hold checkpoints via Hugging Face. Download checkpoints and put it under /sevila_checkpoints. The checkpoints (814.55M) contains pre-trained localizer and zero-shot answerer.

Run Gradio Demo Locally

We also provide a UI for testing our SeViLA locally that is built with gradio. Running demo locally requires about 12GB of memory.

  • Installing Gradio:
pip install gradio==3.30.0
  • Running the following command in a terminal will launch the demo:
python app.py

Dataset Preparation

We test our model on:

Please download original QA data and preprocess them via our scripts.

Training and Inference

We provide SeViLA training and inference script examples as follows.

And please refer to dataset page to custom your data path.

1) Localizer Pre-training

sh run_scripts/sevila/pre-train/pretrain_qvh.sh

2) Localizer Self-refinement

sh run_scripts/sevila/refinement/nextqa_sr.sh

3) Answerer Fine-tuning

sh run_scripts/sevila/finetune/nextqa_ft.sh

4) Inference

sh run_scripts/sevila/inference/nextqa_infer.sh

Acknowledgments

We thank the developers of LAVIS, BLIP-2, CLIP, All-in-One, for their public code release.

Reference

Please cite our paper if you use our models in your works:

@article{yu2023self,
  title={Self-Chained Image-Language Model for Video Localization and Question Answering},
  author={Yu, Shoubin and Cho, Jaemin and Yadav, Prateek and Bansal, Mohit},
  journal={arXiv preprint arXiv:2305.06988},
  year={2023}
}

sevila's People

Contributors

yui010206 avatar eltociear avatar

Stargazers

Zainab A. Kareem Dinar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.