Code Monkey home page Code Monkey logo

tinyllavabench's Introduction

hf_space arXiv License

๐ŸŽ‰ News

  • [2024.02.25] Update evaluation scripts and docs!
  • [2024.02.25] Data descriptions out. Release TinyLLaVA-1.5B and TinyLLaVA-2.0B!
  • [2024.02.24] Example code on inference and model loading added!
  • [2024.02.23] Evaluation code and scripts released!
  • [2024.02.21] Creating the TinyLLaVABench repository on GitHub!
  • [2024.02.21] Our paper: TinyLLaVA: A Framework of Small-scale Large Multimodal Models is out!
  • [2024.01.11] Our fist model TinyLLaVA-1.4B is out!

โŒ› TODO

  • Add support for Ollama and llama.cpp.
  • Developers' guide / How to build demo locally.
  • Model Zoo descriptions.
  • Examples and inference.
  • Release code for training.
  • Add descriptions for evaluation.
  • Add descriptions for data preparation.
  • Release TinyLLaVA-1.5B and TinyLLaVA-2.0B.
  • Release TinyLLaVA-3.1B.
  • Release the evaluation code and weights today(2024.2.23).

๐Ÿ”ฅ High performance, but with fewer parameters

  • Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL.

๐Ÿณ Model Zoo

Legacy Model

https://huggingface.co/bczhou/tiny-llava-v1-hf

Pretrained Model

Model Zoo

Name LLM Checkpoint LLaVA-Bench-Wild MME MMBench MM-Vet SQA-image VQA-v2 GQA TextVQA
TinyLLaVA-3.1B Phi-2 TinyLLaVA-3.1B 75.8 1464.9 66.9 32.0 69.1 79.9 62.0 59.1
TinyLLaVA-2.0B StableLM-2-1.6B TinyLLaVA-2.0B 66.4 1433.8 63.3 32.6 64.7 78.9 61.9 56.4
TinyLLaVA-1.5B TinyLlama TinyLLaVA-1.5B 60.8 1276.5 55.2 25.8 60.3 76.9 60.3 51.7

๐Ÿ”ง Requirements and Installation

We recommend the requirements as follows.

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/DLCV-BUAA/TinyLLaVABench.git
cd TinyLLaVABench
  1. Install Package
conda create -n tinyllava python=3.10 -y
conda activate tinyllava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Upgrade to latest code base

git pull
pip install -e .

# if you see some import errors when you upgrade, please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir

๐Ÿ”ง Quick Start

Load model
from tinyllava.model.builder import load_pretrained_model
from tinyllava.mm_utils import get_model_name_from_path
from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

๐Ÿ”ง Run Inference

Run Inference
from tinyllava.model.builder import load_pretrained_model
from tinyllava.mm_utils import get_model_name_from_path
from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": "phi",
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

eval_model(args)

Evaluation

To ensure the reproducibility, we evaluate the models with greedy decoding.

See Evaluation.md

Data Preparation

In our paper, we used two different datasets: the LLaVA dataset and the ShareGPT4V dataset, and compared their differences. In this section, we provide information on data preparation.

Pretraining Images

  • LLaVA: The pretraining images of LLaVA is from the 558K subset of the LAION-CC-SBU dataset.
  • ShareGPT4V: The pretraining images of ShareGPT4V is a mixture of 558K LAION-CC-SBU subset, SAM dataset, and COCO dataset.

Pretraining Annotations

  • LLaVA: The pretraining annotations of LLaVA are here.
  • ShareGPT4V: The pretraining annotations of ShareGPT4V are here.

SFT Images & Annotations

The majority of the two SFT datasets are the same, with the exception that the 23K detailed description data in LLaVA-1.5-SFT being replaced with detailed captions randomly sampled from the 100K ShareGPT4V data.

Download data

  1. Download relevant images
  1. Download relevant annotations

Organize Data

Organize the image files and annotation files as follows in path/to/your/data:

data
โ”œโ”€โ”€ llava
โ”‚   โ”œโ”€โ”€ llava_pretrain
โ”‚   โ”‚   โ”œโ”€โ”€ images
โ”‚   โ”‚   โ”œโ”€โ”€ blip_laion_cc_sbu_558k.json
โ”œโ”€โ”€ coco
โ”‚   โ”œโ”€โ”€ train2017
โ”œโ”€โ”€ sam
โ”‚   โ”œโ”€โ”€ images
โ”œโ”€โ”€ gqa
โ”‚   โ”œโ”€โ”€ images
โ”œโ”€โ”€ ocr_vqa
โ”‚   โ”œโ”€โ”€ images
โ”œโ”€โ”€ textvqa
โ”‚   โ”œโ”€โ”€ train_images
โ”œโ”€โ”€ vg
โ”‚   โ”œโ”€โ”€ VG_100K
โ”‚   โ”œโ”€โ”€ VG_100K_2
โ”œโ”€โ”€ share_textvqa
โ”‚   โ”œโ”€โ”€ images
โ”œโ”€โ”€ web-celebrity
โ”‚   โ”œโ”€โ”€ images
โ”œโ”€โ”€ web-landmark
โ”‚   โ”œโ”€โ”€ images
โ”œโ”€โ”€ wikiart
โ”‚   โ”œโ”€โ”€ images
โ”œโ”€โ”€ text_files
โ”‚   โ”œโ”€โ”€ llava_v1_5_mix665k.json
โ”‚   โ”œโ”€โ”€ share-captioner_coco_lcs_sam_1246k_1107.json
โ”‚   โ”œโ”€โ”€ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json

โœ Citation

If you find our paper and code useful in your research, please consider giving a star โญ and citation ๐Ÿ“.

@misc{zhou2024tinyllava,
      title={TinyLLaVA: A Framework of Small-scale Large Multimodal Models}, 
      author={Baichuan Zhou and Ying Hu and Xi Weng and Junlong Jia and Jie Luo and Xien Liu and Ji Wu and Lei Huang},
      year={2024},
      eprint={2402.14289},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

โค๏ธ Community efforts

  • Our codebase is built upon the LLaVA project. Great work!
  • Our project uses data from the ShareGPT4V project. Great work!

tinyllavabench's People

Contributors

baichuanzhou avatar jiajunlong avatar eltociear avatar huangleibuaa avatar

Stargazers

Lord Basil - Automate EVERYTHING avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.