Code Monkey home page Code Monkey logo

imp's Introduction

😈 Imp

[Technical report]Β Β [Demo]Β Β [Huggingface]

This repository contains the official training/evaluation code of the Imp project, which aims to provide a family of a strong multimodal small language models (MSLMs). Our imp-v1-3b is a strong MSLM with only 3B parameters, which is build upon a small yet powerful SLM Phi-2 (2.7B) and a powerful visual encoder SigLIP (0.4B), and trained on the LLaVA-v1.5 training set.

As shown in the Evaluation, imp-v1-3b significantly outperforms the counterparts of similar model sizes, and even achieves slightly better performance than the strong LLaVA-7B model on various multimodal benchmarks.

We also release the model weights a running example of Imp-v1-3B on Huggingface. Technical report will be released soon. We will persistently improve our model and release the next versions to further improve model performance :)

Updates

  • May 21, 2024: The technical report and corresponding Imp-v1.5-2B/3B/4B model series are released.
  • February 9, 2024: Training and evaluation codes of the Imp-v1-3B model are released.

Table of Contents

Prerequisites

  1. Clone this repository and navigate to the folder
git clone https://github.com/MILVLG/imp.git
cd imp
  1. Install Package

We recommend using Anaconda to create a new environment for the project, and install the requirements with the following commands:

conda create -n imp python=3.10 -y
conda activate imp
pip install -r requirements.txt
pip install flash-attn==2.4.2 --no-build-isolation
  1. Download the pretrained base models (i.e., Phi-2 and SigLIP) to your local directories. Note that the latest version of the Phi-2 model is not compatible with this repository. We strongly recommend using the following script to download the specific versions of the base models.
python scripts/download_models.py

The base models will be stored in checkpoints/base in default.

checkpoints
└── base
    └── siglip-so400m-patch14-384
    └── phi-2

Model-zoo

The checkpoints of different Imp models are provided in Model_Zoo.md .

Training

The training pipeline and datasets of imp-v1-3b are directly inherited from LLaVA-v1.5. The training

  • Multimodal pretraining: train a projector on a subset of ∼558K image-text pairs to connect a frozen pretrained vision encoder and a frozen LLM.
  • Multimodal instruction tuning: fine-tune the projector and LoRA in the LLM with multimodal instruction data and VQA-formatted data to empower the MLSM the ability of multimodal instruction following.

Imp is trained on 8 A100 (40G) GPUs. You can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps to match your resources. .But always keep the global batch size the same: global_batch_size = per_device_train_batch_size $\times$ gradient_accumulation_steps $\times$ num_gpus.

Training scripts

Stage-1: Multimodal pretraining

Please download the caption annotations blip_laion_cc_sbu_558k.json and images from here. Move the downloaded files to the ./datasets folder, with image folder unzipped and renamed to pretrain_images. Then run the following command to start the training process:

bash scripts/pretrain.sh

After that, a checkpoint file will be stored in ./checkpoints/imp-v1-3b-stage1.

Stage-2: Multimodal instruction tuning

Please download the annotation file of the mixed instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

After downloading all of them, organize the data as follows:

datasets
β”œβ”€β”€ llava_v1_5_mix665k.json
└── finetune_images
    β”œβ”€β”€ coco
    β”‚   └── train2017
    β”œβ”€β”€ gqa
    β”‚   └── images
    β”œβ”€β”€ ocr_vqa
    β”‚   └── images
    β”œβ”€β”€ textvqa
    β”‚   └── train_images
    └── vg
        β”œβ”€β”€ VG_100K
        └── VG_100K_2

Then, you can start the training process by the following script. If you use your custom dataset, you can refer to llava_v1_5_mix665k.json to format your data.

bash scripts/finetune_lora.sh
# bash scripts/finetune.sh # fully finetuning is not recommended

You will get a trained model imp-v1-3b-stage2-lora (a LoRA diff if you use finetune_lora.sh) under ./checkpoints/ when the training is done.

Submodel merging

After the above model training, the model checkpoint consists of multiple sub-models. You can use the following script to merge the stage2 sub-models into a single one for release. Our evaluation script supports both the sub-models and merged model checkpoints. However, if you want to fine-tune the model on your own custom dataset, only the merged model is supported.

bash scripts/merge.sh

After that, a checkpoint file will be stored in ./checkpoints/imp-v1-3b.

Finetuning on custom datasets

You also can finetune Imp using your own custom dataset use finetune_lora_custom.sh. The custom dataset should be in the LLaVA-1.5 format.

bash scripts/finetune_lora_custom.sh

Evaluation

We follow the evaluation of LLaVA-v1.5 and conduct experiments on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks. All evaluation scripts are placed in the scripts/eval folder.

Before preparing task-specific data, you should download eval.zip and unzip it to ./playground/data/eval. For more specific instructions, please refer to LLaVA's Evaluation.md.

It is supported to evaluate your reproduced model checkpoints or our released model. For more detailed evaluation scripts, please refer to Evaluation.md.

Using our provided model, you can reproduce the following results. Our imp-v1-3b model significantly outperforms existing MSLMs of similar model sizes, and is comparable with the strong LLaVA-v1.5-7b model.

Models VQAv2 GQA VizWiz SQA(IMG) TextVQA POPE MME(P) MMB MM-Vet
LLaVA-v1.5-lora (7B) 79.10 63.00 47.80 68.40 58.20 86.40 1476.9 66.10 30.2
TinyGPT-V (3B) - 33.60 24.80 - - - - - -
LLaVA-Phi (3B) 71.40 - 35.90 68.40 48.60 85.00 1335.1 59.80 28.9
MobileVLM (3B) - 59.00 - 61.00 47.50 84.90 1288.9 59.60 -
MC-LLaVA (3B) 64.24 49.60 24.88 - 38.59 80.59 - - -
Imp-v1 (3B, ours) 79.45 58.55 50.09 69.96 59.38 88.02 1434.0 66.49 33.1

Deployment

Based on MLC-LLM, we provide a lightweight deployment solution so that imp can inference efficiently on the mobile device.

  • After 4-bit quantification, imp only takes up about 1.9G of storage space and is fully capable of running on mobile phones.
  • All Android devices are supported and IOS will come soon.
  • Textual and visual modalities are supported.

More details can be found in MILVLG/mlc-imp.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About us

This project is maintained by the MILVLG@Hangzhou Dianzi University (HDU) led by Prof. Zhou Yu and Jun Yu, and is mainly developed by Zhenwei Shao and Xuecheng Ouyang. We hope our model may serve as a strong baseline to inspire future research on MSLM, as well as its derivative applications on mobile devices and robots.

Citation

If you use our model or refer our work in your studies, please cite:

@article{imp2024,
  title={Imp: Highly Capable Large Multimodal Models for Mobile Devices},
  author={Shao, Zhenwei and Yu, Zhou and Yu, Jun and Ouyang, Xuecheng and Lihao, Zheng and Zhenbiao, Gai and Mingyang, Wang and Jiajun, Ding},
  journal={arXiv preprint arXiv:2405.12107},
  year={2024}
}

imp's People

Contributors

mil-vlg avatar paradoxzw avatar bruceisme avatar adarshxs avatar m3dade avatar ricar0 avatar

Stargazers

Ziang Wu avatar Huy HoΓ ng avatar BillHuang avatar  avatar  avatar Shyam Sudhakaran avatar Francesco Taioli avatar Gao Changlong avatar  avatar 唐国撁Tommy avatar EpiphaNy_Wwb avatar Renaud Bouckenooghe avatar Dongwoo Im avatar Shreyas Dixit avatar  avatar  avatar  avatar Mohammad Reza Taesiri avatar Frank avatar Marc Reyes avatar UncleCode avatar Jimmy Tao avatar Arya Chakraborty avatar Yves Kalin avatar  avatar MitchellChenA avatar Alejandro NΓΊΓ±ez Arroyo avatar Zechen Bai avatar Shaoqi Dong avatar elucida avatar Leo avatar noah guo avatar Gulshan.gaur avatar  avatar  avatar tensorboy avatar  avatar Wizyoung avatar Junghwan Park avatar Huy LΓͺ avatar  avatar  avatar  avatar Guangkai Xu avatar Xiaolong avatar hlLi avatar Haoran Duan avatar Dogukan Uraz Tuna avatar pp avatar Ren Tianhe avatar  avatar  avatar Eisneim Terry avatar YC avatar  avatar  avatar Lau Van Kiet avatar Xuemin Zhao avatar Naptmn avatar Li Yun avatar  avatar Guan Dai avatar Saicoco avatar Mario Garcia avatar Geonmo Gu avatar Chuan Hu avatar  avatar  avatar Daniel G Smith avatar Raoni Meira Gabriel avatar Ibrahim Zaman avatar Abhijit Herekar avatar Kishan Bhoopalam avatar Ashish Tanwer avatar Egor Lynov avatar Zhaoyang Liu avatar  avatar John S. Dvorak avatar YeboSun avatar Youli avatar Kevin Raetz avatar Duc Le avatar pe653 avatar  avatar Jerome avatar Zmu avatar Yash Marathe avatar Samuel Boulanger avatar Malte Koch avatar Marcus Gawronsky avatar 姬忠鹏 avatar  avatar  avatar QinLuo avatar 南栖 avatar  avatar hamanishi avatar Jinx avatar Zachariah Mustafa avatar Pooya Mohammadi Kazaj avatar

Watchers

Michele Venturi avatar Dickson Neoh avatar Samuel Boulanger avatar Junghwan Park avatar Abhijit Herekar avatar Malte Koch avatar

imp's Issues

Model Evaluation

Hi,

I have trained imp with lora. However, it does not process the reference when I run the evaluation scripts.
Following is the output when I eval pope.
ζˆͺ屏2024-02-17 δΈ‹εˆ12 17 51

The training environment

Hi,

I try to reproduction the training of Imp. But there is some problem with the training environment as following shows.

ζˆͺ屏2024-02-12 上午8 41 47

My transformers is 4.31.0 as requirement.

Quantized model latency issue.

Hi, I am trying to see if I can load a quantized model of this.

When I load in 4-bit, the model size is smaller but the latency significantly increases.

Not sure if there needs to be any changes to be done to support quantization.

Please, let me know.

I can also help in creating a MR to make the quantized model better.

Thanks

finetune_lora_custom.sh

I only changed

IMP_MODEL='./checkpoints/imp-v1-3b'

--data_path 
--image_folder 

but have this infomation in my terminal

You are using a model of type imp to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type imp to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type imp to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type imp to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
Downloading config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 576/576 [00:00<00:00, 1.89MB/s]
[2024-02-22 16:33:49,885] [WARNING] [partition_parameters.py:836:_post_init_method] param `probe` in SiglipMultiheadAttentionPoolingHead not on GPU so was not broadcasted from rank 0
[2024-02-22 16:33:53,686] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 7.77B parameters
Traceback (most recent call last):
  File "/data1/*** /imp/imp_llava/train/train_mem.py", line 15, in <module>
    train()
  File "/data1/***/imp/./imp_llava/train/train.py", line 827, in train
    model = LlavaLlamaForCausalLM.from_pretrained(
  File "/data1/***/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/data1/***/site-packages/transformers/modeling_utils.py", line 3125, in _load_pretrained_model
    model.apply(model._initialize_weights)
  File "/data1/***/site-packages/torch/nn/modules/module.py", line 884, in apply
    module.apply(fn)
  File "/data1/***/site-packages/torch/nn/modules/module.py", line 884, in apply
    module.apply(fn)
  File "/data1/***/site-packages/torch/nn/modules/module.py", line 885, in apply
    fn(self)
  File "/data1/***/site-packages/transformers/modeling_utils.py", line 1261, in _initialize_weights
    self._init_weights(module)
  File "/data1/***/site-packages/transformers/models/llama/modeling_llama.py", line 472, in _init_weights
    module.weight.data[module.padding_idx].zero_()
IndexError: index 50256 is out of bounds for dimension 0 with size 0
[2024-02-22 16:33:55,511] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2275311
[2024-02-22 16:33:55,524] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2275312
[2024-02-22 16:33:55,535] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2275313
[2024-02-22 16:33:55,545] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2275314

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.