Code Monkey home page Code Monkey logo

vstar's Introduction

VSTAR

This is the official implementation of the ACL 2023 paper "VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions"

Dataset

Schedule

  • release dialogues
  • release feature (resnet, rcnn)
  • release test data (2024.07.17)
  • release meta data (genres, keywords, storyline, characters: name, avatar)
  • release frames

Dialogues

support language: English, 简体中文

  • Downloads

Storage: Train (196M); Valid(11.6M); test (24M)

Links: BaiduNetDisk or GoogleDrive

  • Statistics
clips dialogues scene/clip topic/clip
Train 172,041 4,319,381 2.42 3.68
Val 9753 250,311 2.64 4.29
Test 9779 250,436 2.56 4.12
  • Format
{
	"dialogs":[
		{
			"clip_id": "Friends_S01E01_clip_000",
			"dialog": ["hi", ...],
			"scene": [1, 1, 1, 1, 1, 1, 2, 2, ...],
			"session": [1, 1, 1, 2, 2, 2, 3, 3, ...]
		},
		...
]
}

Feature

  • Downloads

Storage: RCNN(246.2G), RESNET(109G)

Links: BaiduNetDisk

  • Format

File Structure:

# [name of TV show]_S[season]_E[episode]_clip_[clip id].npy
├── Friends_S01E01
   └── Friends_S01E01_clip_000.npy
   └── Friends_S01E01_clip_001.npy
   └── ...
├── ...

ResNet:

# numpy.load("Friends_S01E01_clip_000.npy")
(num_of_frames * 1000)

RCNN:

# numpy.load("Friends_S01E01_clip_000.npy", allow_pickle=True).item()
{
	"feature": (9 * num_of_frames * 2048) # array(float32), feature top 9 objects
	"size": (num_of_frames * 2) # list(int), size of original frame
	"box": (9 * num_of_frames * 4) # array(float32), bbox
	"obj_id": (9 * num_of_frames) # list(int), object id
	"obj_conf": (9 * num_of_frames) # array(float32), object conference 
	"obj_num": (num_of_frames) # list(int), number of objects/frame
}
  • Feature Extraction Tools

Please Refer to OpenViDial_extract_features

Installation

pip install -r requirements.txt

Scene Segmentation

  • Preprocess

move train.json, valid.json, test.json to inputs/full directory

run following script to change the original to binary format to run our baseline smoothly (check in our paper)

cd inputs/full
python preprocess.py
  • Train
python train_seg.py \
	--video 1 \
	--exp_set EXP_LOG \
	--train_batch_size 4 \
  • Infer
python generate_seg.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 1 \

Topic Segmentation

  • Train
python train_seg.py \
	--video 0 \
	--exp_set EXP_LOG \
	--train_batch_size 4 \
  • Infer
python generate_seg.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 0 \

Dialogue Generation

To use coco_caption for evaluation, run the following script to generate the reference file:

cd inputs/full
python coco_caption_reformat.py

for the evaluation details, please refer to: https://github.com/tylin/coco-caption

  • Train
python train_gen.py \
	--train_batch_size 4 \
	--model bart \
	--exp_set EXP_LOG \
	--video 1 \
	--fea_type resnet \

  • Infer
python generate.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 1 \
	--sess 1 \
	--batch_size 4

Citation

@misc{wang2023vstar,
    title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions},
    author={Yuxuan Wang and Zilong Zheng and Xueliang Zhao and Jinpeng Li and Yueqian Wang and Dongyan Zhao},
    year={2023},
    eprint={2305.18756},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

vstar's People

Contributors

patrick-tssn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.