Code Monkey home page Code Monkey logo

cmmu's Introduction

CMMU

๐Ÿ“– Paper | ๐Ÿค— Dataset | GitHub

This repo contains the evaluation code for the paper CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning .

We release the validation set of CMMU, you can download it from here. The test set will be hosted on the flageval platform. Users can test by uploading their models.

Introduction

CMMU is a novel multi-modal benchmark designed to evaluate domain-specific knowledge across seven foundational subjects: math, biology, physics, chemistry, geography, politics, and history. It comprises 3603 questions, incorporating text and images, drawn from a range of Chinese exams. Spanning primary to high school levels, CMMU offers a thorough evaluation of model capabilities across different educational stages.

Evaluation Results

We currently evaluated 10 models on CMMU. The results are shown in the following table.

Evaluate fill-in-the-blank questions by GPT-4

Model Val Avg. Test Avg.
InstructBLIP-13b 0.39 0.48
CogVLM-7b 5.55 4.90
ShareGPT4V-7b 7.95 7.63
mPLUG-Owl2-7b 8.69 8.58
LLava-1.5-13b 11.36 11.96
Qwen-VL-Chat-7b 11.71 12.14
Intern-XComposer-7b 17.87 18.42
Gemini-Pro 21.58 22.50
Qwen-VL-Plus 27.51 27.73
GPT-4V 30.19 30.91

Evaluate fill-in-the-blank questions by rules

Model Val Avg. Test Avg.
InstructBLIP-13b 0.04 0.00
CogVLM-7b 4.02 3.73
ShareGPT4V-7b 5.85 5.07
mPLUG-Owl2-7b 7.08 6.85
LLava-1.5-13b 8.17 8.06
Qwen-VL-Chat-7b 7.73 8.28
Intern-XComposer-7b 16.95 16.30
Gemini-Pro 18.77 17.87
Qwen-VL-Plus 21.19 21.34
GPT-4V 24.73 25.23

How to use

Load dataset

from eval.cmmu_dataset import CmmuDataset
# CmmuDataset will load *.jsonl files in data_root
dataset = CmmuDataset(data_root=your_path_to_cmmu_dataset)

About fill-in-the-blank questions

For fill-in-the-blank questions, CmmuDataset will generate new questions by sub_question, for example:

The original question is:

{
    "type": "fill-in-the-blank",
    "question_info": "question", 
    "id": "subject_1234", 
    "sub_questions": ["sub_question_0", "sub_question_1"],
    "answer": ["answer_0", "answer_1"]
}

Converted questions are:

[
{
    "type": "fill-in-the-blank",
    "question_info": "question" + "sub_question_0", 
    "id": "subject_1234-0",
    "answer": "answer_0"
},
{
    "type": "fill-in-the-blank",
    "question_info": "question" + "sub_question_1", 
    "id": "subject_1234-1",
    "answer": "answer_1"
}
]

About ShiftCheck

The parameter shift_check is True by default, you can get more information about shift_check in our technical report.

CmmuDataset will generate k new questions by shift_check, their ids are {original_id}-k.

Evaluate

The output format should be a list of json dictionaries, the required key is as follows:

{
    "question_id": "question id",
    "answer": "answer"
}

Here is result generated by GPT-4.

Current code call gpt4 API by AzureOpenAI, maybe you need to modify eval/chat_llm.py to create your own client, and before run evaluation, you need to set environment variables like AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT.

Run

python eval/evaluate.py --result your_pred_file --data-root your_path_to_cmmu_dataset

NOTE: We evaluate fill-in-the-blank questions using GPT-4 by default. If you do not have access to GPT-4, you can attempt to use a rule-based method to fill in the blanks. However, be aware that the results might differ from the official ones.

python eval/evaluate.py --result your_pred_file --data-root your_path_to_cmmu_dataset --gpt none

To evaluate specific type of questions, you can use --qtype parameter, for example:

python eval/evaluate.py --result example/gpt4v_results_val.json --data-root your_path_to_cmmu_dataset --qtype fbq mrq

Detailed evaluation results are saved in *_result.json

Citation

BibTeX:

@article{he2024cmmu,
        title={CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning},
        author={Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu and Hua Huang},
        journal={arXiv preprint arXiv:2401.14011},
        year={2024},
      }

cmmu's People

Contributors

philokey avatar qwerwxyz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cmmu's Issues

evaluate.py argument error

Thanks for the great job!
When I try to run the evaluation code, I find that evaluate.py accepts 'data-root' as the CLI argument, while the argument name in the README.md is 'data_root', maybe it would be better to fix the error in the document.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.