Code Monkey home page Code Monkey logo

mmmu-benchmark / mmmu Goto Github PK

View Code? Open in Web Editor NEW
329.0 4.0 21.0 3.43 MB

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Home Page: https://mmmu-benchmark.github.io/

License: Apache License 2.0

Python 100.00%
computer-vision deep-learning deep-neural-networks evaluation foundation-models large-language-models large-multimodal-models llm llms machine-learning

mmmu's Introduction

MMMU

๐ŸŒ Homepage | ๐Ÿค— Dataset | ๐Ÿค— Paper | ๐Ÿ“– arXiv | GitHub

This repo contains the evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

๐Ÿ””News

  • ๐Ÿš€[2024-01-31]: We added Human Expert performance on the Leaderboard!๐ŸŒŸ
  • ๐Ÿ”ฅ[2023-12-04]: Our evaluation server for test set is now availble on EvalAI. We welcome all submissions and look forward to your participation! ๐Ÿ˜†

Introduction

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 32 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence (AGI).

Alt text

Dataset Creation

MMMU was created to challenge multimodal models with tasks that demand college-level subject knowledge and deliberate reasoning, pushing the boundaries of what these models can achieve in terms of expert-level perception and reasoning. Please refer to our huggingface ๐Ÿค— Dataset for more details.

Evaluation

Please refer to our eval folder for more details.

๐Ÿ† Mini-Leaderboard

Model Val (900) Test (10.5K)
Expert (Best) 88.6 -
Expert (Medium) 82.6 -
Expert (Worst) 76.2 -
GPT-4o* 69.1 -
Gemini 1.5 Pro* 62.2 -
InternVL2-Pro* 62.0 55.7
Gemini 1.0 Ultra* 59.4 -
Claude 3 Opus* 59.4 -
GPT-4V(ision) (Playground) 56.8 55.7
Reka Core* 56.3 -
Gemini 1.5 Flash* 56.1 -
SenseChat-Vision-0423-Preview* 54.6 50.3
Reka Flash* 53.3 -
Claude 3 Sonnet* 53.1 -
HPT Pro* 52.0 -
VILA1.5* 51.9 46.9
Qwen-VL-MAX* 51.4 46.8
InternVL-Chat-V1.2* 51.6 46.2
Skywork-VL* 51.4 46.2
LLaVA-1.6-34B* 51.1 44.7
Claude 3 Haiku* 50.2 -
Adept Fuyu-Heavy* 48.3 -
Gemini 1.0 Pro* 47.9 -
Marco-VL-Plus* 46.2 44.3
Yi-VL-34B* 45.9 41.6
Qwen-VL-PLUS* 45.2 40.8
HPT Air* 44.0 -
Reka Edge* 42.8 -
Marco-VL* 41.2 40.4
OmniLMM-12B* 41.1 40.4
Bunny-8B* 43.3 39.0
Bunny-4B* 41.4 38.4
Weitu-VL-1.0-15B* - 38.4
InternLM-XComposer2-VL* 43.0 38.2
Yi-VL-6B* 39.1 37.8
InfiMM-Zephyr-7B* 39.4 35.5
InternVL-Chat-V1.1* 39.1 35.3
Math-LLaVA-13B* 38.3 34.6
SVIT* 38.0 34.1
MiniCPM-V* 37.2 34.1
MiniCPM-V-2* 37.1 -
Emu2-Chat* 36.3 34.1
BLIP-2 FLAN-T5-XXL 35.4 34.0
InstructBLIP-T5-XXL 35.7 33.8
LLaVA-1.5-13B 36.4 33.6
Bunny-3B* 38.2 33.0
Qwen-VL-7B-Chat 35.9 32.9
SPHINX* 32.9 32.9
mPLUG-OWL2* 32.7 32.1
BLIP-2 FLAN-T5-XL 34.4 31.0
InstructBLIP-T5-XL 32.9 30.6
Gemini Nano2* 32.6 -
CogVLM 32.1 30.1
Otter 32.2 29.1
LLaMA-Adapter2-7B 29.8 27.7
MiniGPT4-Vicuna-13B 26.8 27.6
Adept Fuyu-8B 27.9 27.4
Kosmos2 24.4 26.6
OpenFlamingo2-9B 28.7 26.3
Frequent Choice 22.1 23.9
Random Choice 26.8 25.8

*: results provided by the authors.

๐ŸŽฏ We have released a full suite comprising 150 development samples and 900 validation samples. However, the 10,500 test questions are available without their answers. Use the development set for few-shot/in-context learning, and the validation set for debugging models, selecting hyperparameters, and quick evaluations. The answers and explanations for the test set questions are withheld. You can submit your model's predictions for the test set on EvalAI.

Disclaimers

The guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution. Should you encounter any data samples potentially breaching the copyright or licensing regulations of any site, we encourage you to contact us. Upon verification, such samples will be promptly removed.

Contact

Citation

BibTeX:

@inproceedings{yue2023mmmu,
  title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
  author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen},
  booktitle={Proceedings of CVPR},
  year={2024},
}

mmmu's People

Contributors

drogozhang avatar nipelement avatar xiangyue9607 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mmmu's Issues

How was "prompt engineering" performed?

Hi, great work! I'm not seeing any examples of how you convert the input documents to actual prompts for the model. In the paper, the only relevant snippet seems to be:

If models do not provide prompts for task types in MMMU, we conduct prompt engineering on the validation set and use the most effective prompt for the zero-shot setup in the main experiments.

Can you please provide examples of how you formatted the inputs to any of the models you evaluated this on? Thanks!

Note: I see that in #5 you clarify that you follow MMLU, but this seems to contradict the statement in the paper about prompt engineering on the validation set. Can you clarify that in particular?

Error reports when loading the dataset

Hi, I am try to load your dataset with the provided command:
from datasets import load_dataset
dataset = load_dataset("MMMU/MMMU")

However, one error reports as:

ExpectedMoreSplits: {'dev'}

Can u check it?

Model Evaluatation

Thanks for your great work! Do you have any plans to release the evaluation code for LLaVA-v1.5? Looking forward to your reply.

Add validation set to EvalAI

Would it be possible to add MMMU validation to EvalAI?

It'd be great to be able to compare the numbers calculated on the validation set with the ones produced by EvalAI.

There's an error in one of the Correct Examples for genetics

Looking through the Correct Examples I came across the genetics example, and it is wrong.

2024-0427_173712_MMMU_โ€”_Mozilla_Firefox

  • The numbers doesn't correspond to what's in the image.
  • The logic/reasoning shown is regardless incorrect even if numbers had been ok.

It would be interesting to know how you guys are wetting what is considered to be correct?

Prompts

Hi!
Am I right, that prompts for models were like: "Question <...>, Options: (A) <...>, (B) ..." for multiple-choice and just "Question: <...>" for short-answer? But there're not any defined option letters for multiple-choice in dataset, or you just match each option to the letter in order?

process_single_sample function's question

A little confused about this part of the code. The comment means that when an image appears in the options, the image is set to None. But the way images appear in the data options is always as <image_x>, <img='(.*?) '> can't match anything.

def parse_img_path(text):
matches = re.findall("<img='(.*?)'>", text)
return matches
def process_single_sample(data):
question = data['question']
o_imgs_paths = []
for option in data['options']:
current_o_imgs_paths = parse_img_path(option)
for img_path in current_o_imgs_paths:
o_imgs_paths.append(img_path)
if len(o_imgs_paths) > 1: # multiple images in options, used for random selection
return {'id': data['id'], 'question': question, 'options': data['options'], 'answer': data['answer'],
'image': None, 'question_type': data['question_type']}
else:
return {'id': data['id'], 'question': question, 'options': data['options'], 'answer': data['answer'],
'image': data['image_1'], 'question_type': data['question_type']}

Link for the open source methods in Leaderboard

Hi I find that the leaderboard has added some new methods, however, for some open source methods, I cannot find either their codes or even their papers. Can u add the relevant link to these methods (e.g., Macro-VL, InfiMM-zephyr) on your leaderboard?

Evaluation Prompt for mPLUG-Owl2

Could I possibly know the evaluation prompt for mPLUG-Owl2? It seems that the subpar performance might be a result of an inappropriate prompt.

For example, the prompt short-answer generation would be:

<|image|>{QUESTION}\nAnswer the question using a single word or phrase.

For multiple-choice question, the prompt would be:

<|image|>{QUESTION}\n{OPTIONS}\nAnswer with the optionโ€™s letter from the given choices directly.

Question about "Text as Input"

Thank you for your valuable MMMU benchmark.
I have a question regarding your paper. You mentioned in the article that each data point contains at least one image. Then, how were the results for Llama2 7B, FLAN-T5-XXL, Vicuna-13B and GPT-4 Text without OCR or LLaVA Caption obtained?

RuntimeError: The size of tensor a (162) must match the size of tensor b (7) at non-singleton dimension 1

When I run run_llava.py, I get the following error:
Traceback (most recent call last):
File "/home/MMMU-main/eval/run_llava.py", line 106, in
main()
File "/home/MMMU-main/eval/run_llava.py", line 98, in main
out_samples = run_model(args, samples, model, call_model_engine, tokenizer, processor)
File "/home/MMMU-main/eval/run_llava.py", line 23, in run_model
response = call_model_engine_fn(args, sample, model, tokenizer, processor)
File "/home/MMMU-main/eval/utils/model_utils.py", line 57, in call_llava_engine_df
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
RuntimeError: The size of tensor a (162) must match the size of tensor b (7) at non-singleton dimension 1
What causes this?

PNG files does not convert to RGB

I used this benchmark to measure performance of LLaVA.
The following warning message appeared.

/usr/local/lib/python3.10/dist-packages/PIL/Image.py:981: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images

I think image files did not convert to RGB type image files.

Request for answer_dict.json for test and dev

The answer_dict_val.jsonl is very simple and helpful for evaluating the value set of the datasets.

I am wondering if you have any plan to release the answer dict for the test set and dev set as well to simply the evaluation? Thank you!

Why is every answer in Structural Engineering just "?"

I was browsing the test set with the Dataset Viewer on HuggingFace (Link) and noticed that, for the Structural Engineering subset of Architecture_and_Engineering, literally every single answer and explanation is equal to "?". Surely this is a bug?

image

Image and JSON dataset.

Thanks for your great work! I want to use your data to train a model and I wonder how to transform the parquet dataset format into images and JSON file. Thanks in advance!

No (supported) data files found in /MMMU/MMMU

Hi, I am try to load your dataset with the provided command:

from datasets import load_dataset
sub_dataset = load_dataset("/data/datasets/MMMU", "Accounting")

However, one error reports as:

Exception has occurred: DataFilesNotFoundError
No (supported) data files found in /MMMU/MMMU
  File "/data/project/bunny/test_eval_mmmu.py", line 4, in <module>
    sub_dataset = load_dataset("/MMMU/MMMU", "Accounting")
datasets.exceptions.DataFilesNotFoundError: No (supported) data files found in /MMMU/MMMU

Can u check it?

model evaluation

Thank you for your great evaluation, we have recently used a training strategy similar to LLava, which co-trains vqa and chat data, resulting in significant improvements. Could you re evaluate our model? https://github.com/THUDM/CogVLM/

Representing LLaVa-1.5-13b

Hi, I'm trying to represent LLaVa-1.5-13b on your benchmark using prompts like:

"Question: <image 1> Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?\n Option: (A) $6\n(B) $7\n(C) $8\n(D) $9\nAnswer:"

for multiple-choice (e.g validation_Accounting_1)

Or for open questions:

Question: Using a finite summation, compute the  initial deflection at midspan for the beam in  Figure P8.42. Given: E = 3000 kips/in.2 .  Use 3-ft segments. Assume I = 0.5IG. <image 1>\nAnswer:

(e.g validation_Architecture_and_Engineering_14)
But I'm getting only 32.5% on validation split vs 36.4% (I tried only questions with single input image - 856/900).
What could be the problem?

Originally posted by @teasgen in #5 (comment)

validation_Materials_25 answer seems wrong?

validation_Materials_25

Question:

The density and associated percent crystallinity for two polypropylene materials are as follows: <image 1> Compute the densities of totally crystalline.

Choices:

['0.917 g/cm^3', '0.841 g/cm^3', '0.946 g/cm^3', '0.841 g/cm^3']

Correct answer: B

image

First of all, answer B and D seem exactly the same, both are 0.841 g/cm^3. Meanwhile, according to the density formula, the final answer for totally crystalline should be around 0.943 (my own calculation), so C seems to be the correct one. Just want to report this.

GPT4o

GPT-4o (โ€œoโ€ for โ€œomniโ€) is openai's most advanced model. It is multimodal (accepting text or image inputs and outputting text), and it has the same high intelligence as GPT-4 Turbo but is much more efficientโ€”it generates text 2x faster and is 50% cheaper. Additionally, GPT-4o has the best vision and performance across non-English languages of any of openai's models. GPT-4o is available in the OpenAI API to paying customers. https://platform.openai.com/docs/models/gpt-4o

Mismatch of the data label in Eval code

Hi, I am checking your dataset, and the label of "validation_Literature_10" is different from the one in your eval code:

"ground_truth": "D"

where the GT is "D".

However In the dataset, this data point is:
{'id': 'validation_Literature_10', 'question': 'Refer to the description <image 1>, which term refers to the choices a writer makes, including the combination of distinctive features such as mood, voice, or word choice?', 'options': "['Theme', 'Setting', 'Style', 'Tone']", 'explanation': '', 'image_1': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=340x291 at 0x7FFE9C906620>, 'image_2': None, 'image_3': None, 'image_4': None, 'image_5': None, 'image_6': None, 'image_7': None, 'img_type': "['Icons and Symbols']", 'answer': 'C', 'topic_difficulty': 'Easy', 'question_type': 'multiple-choice', 'subfield': 'Comparative Literature'}
where the GT is "C", however.

Can u check it?

.tsv file

Hello, could you please tell me where can I find MMMU_DEV_VAL.tsv for VLMEvalKit. In your huggingface page, I only find .parquet files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.