Code Monkey home page Code Monkey logo

video-bench's Introduction

If you like our project, please give us a star ⭐ on GitHub for latest update.

hf_space arXiv License Hits GitHub issues GitHub closed issues

  • We introduce Video-Bench, the first comprehensive evaluation benchmark for Video-LLMs, featuring a three-level ability assessment that systematically evaluates models in video-exclusive understanding, prior knowledge incorporation, and video-based decision-making abilities.
  • We provide a user-friendly evaluation toolkit. Accompanied by our datasets and QA pairs, the toolkit can streamline the performance assessment of Video-LLMs.
  • We conduct extensive experiments to evaluate prominent Video-LLMs, summarizing their behaviors, analyzing main causes for observed limitations, and proposing future directions for improvement.
πŸ’‘ I also have other video-language projects that may interest you ✨.

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan
github arXiv

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan
github arXiv

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan
github arXiv

πŸ“° News

[2023.12.31] We have updated the performance of the Sphinx-V2 model on the Video Bench LeaderBoard, significantly surpassing other VLLMs!

[2023.11.27] Video-Bench is released! Data and evaluation code is available now.

πŸ“£ Leaderboard

Welcome to Video-Benchmark Leaderboard!

🚩🚩🚩 We are delighted to have witnessed the remarkable advancements in video understanding and artificial intelligence alongside the community over the past year. We are proud to announce the launch of Video-Bench, a platform designed to assist developers and users in the field of video analysis.

πŸ”₯πŸ”₯πŸ”₯ Video-Bench is committed to promoting the progress of video understanding models and facilitating their evaluation. We are pleased to announce the inaugural Video-Bench Leaderboard. This leaderboard aims to systematically evaluate the performance of video understanding models across various capabilities, including Prior Knowledge based QA, Comprehension Decision-making, Video-exclusive Understanding, and more. The leaderboard will feature rankings for open-source models, providing an inclusive and comprehensive reference for the industry and research community. We invite developers and researchers working on video understanding models to join Video-Bench and showcase their models' performance advantages in different domains.

πŸ‘‹πŸ‘‹πŸ‘‹ We also welcome valuable suggestions and contributions from the community to foster collaborative growth and advancement in video understanding models. If you have any questions or would like to get involved, please feel free to contact us. Let's eagerly anticipate the release of the Video-Bench Leaderboard and the continued progress in video understanding and artificial intelligence!

πŸ€— Evaluation

  1. Clone this repository and navigate to Video-Bench folder
git clone https://github.com/PKU-YuanGroup/Video-Bench.git
cd Video-Bench
  1. Install additional packages
pip install -r requirements.txt

πŸ“‚ Data Preparation

The video data can easily be downloaded from Huggingface

πŸ—οΈ Evaluate your own model

The code below is just a generalized framework for dataset evaluation, you will need to refine the model loading part according to your own model. Once the code execution is complete, you will find some JSON files named ./Chat_results/{dataset_name}.json.

Step1: Chat with your own model to obtain conversation results.

import argparse
import os
import json

parser = argparse.ArgumentParser()
parser.add_argument("--dataset_name", type=str, default=None, help="The type of LLM")
parser.add_argument("--Eval_QA_root", type=str, default='./', help="folder containing QA JSON files")
parser.add_argument("--Eval_Video_root", type=str, default='./', help="folder containing video data")
parser.add_argument("--chat_conversation_output_folder", type=str, default='./Chat_results', help="")
args = parser.parse_args()

Eval_QA_root = args.Eval_QA_root
Eval_Video_root = args.Eval_Video_root
dataset_qajson = {
  "Ucfcrime": f"{Eval_QA_root}/Eval_QA/Ucfcrime_QA_new.json",
  "Youcook2": f"{Eval_QA_root}/Eval_QA/Youcook2_QA_new.json",
  "TVQA": f"{Eval_QA_root}/Eval_QA/TVQA_QA_new.json",
  "MSVD": f"{Eval_QA_root}/Eval_QA/MSVD_QA_new.json",
  "MSRVTT": f"{Eval_QA_root}/Eval_QA/MSRVTT_QA_new.json",
  "Driving-decision-making": f"{Eval_QA_root}/Eval_QA/Driving-decision-making_QA_new.json",
  "NBA": f"{Eval_QA_root}/Eval_QA/NBA_QA_new.json",
  "SQA3D": f"{Eval_QA_root}/Eval_QA/SQA3D_QA_new.json",
  "Driving-exam": f"{Eval_QA_root}/Eval_QA/Driving-exam_QA_new.json",
  "MV": f"{Eval_QA_root}/Eval_QA/MV_QA_new.json",
  "MOT": f"{Eval_QA_root}/Eval_QA/MOT_QA_new.json",
  "ActivityNet": f"{Eval_QA_root}/Eval_QA/ActivityNet_QA_new.json",
  "TGIF": f"{Eval_QA_root}/Eval_QA/TGIF_QA_new.json"
}

if args.dataset_name is None:
    dataset_name_list = list(dataset_qajson.keys())
else:
    dataset_name_list = [args.dataset_name]
    print(f'Specifically run {args.dataset_name}')
print(dataset_name_list)

os.makedirs(args.chat_conversation_output_folder, exist_ok=True)

for dataset_name in dataset_name_list:
    qa_json = dataset_qajson[dataset_name]
    print(f'Dataset name:{dataset_name}, {qa_json=}!')
    with open(qa_json, 'r', encoding='utf-8') as f:
        data = json.load(f)
        
    eval_dict = {}
    for idx, (q_id, item) in enumerate(data.items()):
        try:   
            video_id = item['video_id']
            question = item['question'] 
            if len(item['choices']) == 6:
                question += f"Choices: A.{item['choices']['A']} B.{item['choices']['B']} C.{item['choices']['C']} D.{item['choices']['D']} E.{item['choices']['E']} F.{item['choices']['F']} \n Among the six options A, B, C, D, E, F above, the one closest to the correct answer is:"
                candidates = ['A', 'B', 'C', 'D', 'E', 'F']
                candidates_long = [f" A.{item['choices']['A']}", f"B.{item['choices']['B']}", f"C.{item['choices']['C']}", f"D.{item['choices']['D']}", f"E.{item['choices']['E']}", f"F.{item['choices']['F']}"]
            elif len(item['choices']) == 5:
                question += f" A.{item['choices']['A']} B.{item['choices']['B']} C.{item['choices']['C']} D.{item['choices']['D']} E.{item['choices']['E']} \n Among the five options A, B, C, D, E above, the one closest to the correct answer is: "
                candidates = ['A', 'B', 'C', 'D', 'E']
                candidates_long = [f" A.{item['choices']['A']}", f"B.{item['choices']['B']}", f"C.{item['choices']['C']}", f"D.{item['choices']['D']}", f"E.{item['choices']['E']}"]
            elif len(item['choices']) == 4:
                question += f" A.{item['choices']['A']} B.{item['choices']['B']} C.{item['choices']['C']} D.{item['choices']['D']} \n Among the four options A, B, C, D above, the one closest to the correct answer is:"
                candidates = ['A', 'B', 'C', 'D']
                candidates_long = [f" A.{item['choices']['A']}", f"B.{item['choices']['B']}", f"C.{item['choices']['C']}", f"D.{item['choices']['D']}"]
            elif len(item['choices']) == 3:
                question += f" A.{item['choices']['A']} B.{item['choices']['B']} C.{item['choices']['C']} \n Among the three options A, B, C above, the one closest to the correct answer is: "
                candidates = ['A', 'B', 'C']
                candidates_long = [f" A.{item['choices']['A']}", f"B.{item['choices']['B']}", f"C.{item['choices']['C']}"]
            elif len(item['choices']) == 2:
                question += f" A.{item['choices']['A']} B.{item['choices']['B']} \n Among the two options A, B above, the one closest to the correct answer is: "
                candidates = ['A', 'B']
                candidates_long = [f" A.{item['choices']['A']}", f"B.{item['choices']['B']}"]
            vid_rela_path = item['vid_path']
            vid_path = os.path.join(Eval_Video_root, vid_rela_path)


            #=================================You need to change this code =========================
            # ......
            output, output_scores = ask(args, question, model, tokenizer, image_processor, vid_path)
            # ......
            #=======================================================================================

            eval_dict[q_id] = {
                'video_id': video_id,
                'question': question,
                'output_sequence': output
            }  
            print(f'q_id:{q_id}, output:{output}!\n')
        except Exception as e:
            traceback.print_exc()  
    # eval results
    eval_dataset_json = f'{args.chat_conversation_output_folder}/{dataset_name}_eval.json'
    with open(eval_dataset_json, 'w', encoding='utf-8') as f:
        json.dump(eval_dict, f, indent=2)

After obtaining the ./Chat_results/{dataset_name}.json files, you can utilize ChatGPT or T5 model as experts to assess the correctness of the model's output answer. The specific code is as follows:

Step2: Evaluate your model's answer and obtain final scores across 13 datasets.

ChatGPT Evaluation Note that since chatgpt may answer some formatting errors, you need to run below Step2_chatgpt_judge.py multiple times to ensure that each question is validated by chatgpt!

python Step2_chatgpt_judge.py  --model_chat_files_folder ./Chat_results  \
--apikey sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \  # --apikey need to specify your openai apikey account
--chatgpt_judge_output_folder  ./ChatGPT_Judge
python Step3_merge_into_one_json.py  --chatgpt_judge_files_folder ./ChatGPT_Judge \
--merge_file  ./Video_Bench_Input.json

After you get the Video_Bench_Input.json file, you can submit this file to Video-Bench leaderboard to compare with other models!

🐳 License

Video-Bench is released under Apache License Version 2.0.

🀝 Contributors

video-bench's People

Contributors

binzhu-ece avatar linb203 avatar munanning avatar yuanli2333 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

video-bench's Issues

Plans for releasing eval(scoring) code?

Hi.

I noticed that there is no code that will give me the score before submitting to the leaderboard.
Want to check my scores before submitting it and if it's low, I'll work on it more and resubmit it maybe.

Do you have plans for releasing the video_bench scoring code?

Can not submit on Video-Bench Leaderboard

Hi, I want to evaluate my model result on UCF-CRIME dataset, and I made the json file Video-Bench-Input.json. When I press the "submit eval" botton on the hugging face page of Video-Bench Leaderboard, it seems no response. And I clicked β€˜Refresh’ to obtain the uploaded leaderboard, no uploaded yet. What's the problem? Please help me to solve it, thanks a lot!

How to generate multi-choice questions from Basic QA datasets?

Hi, thanks for sharing the impressive benchmark.

I'm particularly interested in understanding more about the multi-choice questions used for assessing Basic QA ability. Could you kindly provide additional information on how they are obtained? (generated by ChatGPT or manually written?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.