Code Monkey home page Code Monkey logo

mgm's Introduction

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

The framework supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B with image understanding, reasoning, and generation simultaneously. We build this repo based on LLaVA.

Release

  • [05/03] ๐Ÿ”ฅ We support LLaMA3-based models! Welcome to try them here.
  • [04/15] ๐Ÿ”ฅ The Hugging Face demo is available. It's a 13B-HD version, welcome to watch and try.
  • [03/28] ๐Ÿ”ฅ Mini-Gemini is coming! We release the paper, demo, code, models, and data!

Contents

Demo

We provide some selected examples in this section. More examples can be found in our project page. Feel free to try our online demo!

Install

Please follow the instructions below to install the required packages.

NOTE: If you want to use the 2B version, please ensure to install the latest version Transformers (>=4.38.0).

  1. Clone this repository
git clone https://github.com/dvlab-research/MGM.git
  1. Install Package
conda create -n mgm python=3.10 -y
conda activate mgm
cd MGM
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation

Model

The framework is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.

We provide all our fully finetuned models on Stage 1 and 2 data:

Model LR HR Base LLM Vision Encoder Finetuning Data Finetuning schedule Download
MGM-2B 336 768 Gemma-2B CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-7B 336 768 Vicuna-7B-v1.5 CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-13B 336 768 Vicuna-13B-v1.5 CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-8B 336 768 LLaMA-3-8B-Instruct CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-8x7B 336 768 Mixtral-8x7B-Instruct-v0.1 CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-34B 336 768 Nous-Hermes-2-Yi-34B CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-7B-HD 672 1536 Vicuna-7B-v1.5 CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-13B-HD 672 1536 Vicuna-13B-v1.5 CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-8B-HD 672 1536 LLaMA-3-8B-Instruct CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-8x7B-HD 672 1536 Mixtral-8x7B-Instruct-v0.1 CLIP-L MGM-Instruct full_ft-1e ckpt
MGM-34B-HD 672 1536 Nous-Hermes-2-Yi-34B CLIP-L MGM-Instruct full_ft-1e ckpt

Here are the pretrained weights on Stage 1 data only:

Model LR HR Base LLM Vision Encoder Pretrain Data Finetuning schedule Download
MGM-2B 336 768 Gemma-2B CLIP-L MGM-Pretrain 1e ckpt
MGM-7B 336 768 Vicuna-7B-v1.5 CLIP-L MGM-Pretrain 1e ckpt
MGM-13B 336 768 Vicuna-13B-v1.5 CLIP-L MGM-Pretrain 1e ckpt
MGM-8x7B 336 768 Mixtral-8x7B-Instruct-v0.1 CLIP-L MGM-Pretrain 1e ckpt
MGM-34B 336 768 Nous-Hermes-2-Yi-34B CLIP-L MGM-Pretrain 1e ckpt

Preparation

Dataset

We provide the processed data for the model training. For model pretraining, please download the following the training image-based data and organize them as:

-> means put the data in the local folder.

  • LLaVA Images -> data/MGM-Pretrain/images, data/MGM-Finetune/llava/LLaVA-Pretrain/images
  • ALLaVA Caption -> data/MGM-Pretrain/ALLaVA-4V

For model finetuning, please download the following the instruction data and organize them as:

-> means put the data in the local folder.

For model evaluation, please follow this link for preparation. We use some extra benchmarks for evaluation. please download the following the training image-based data and organize them as:

-> means put the data in the local folder.

  • MMMU -> data/MGM-Eval/MMMU
  • MMB -> data/MGM-Eval/MMB
  • MathVista -> data/MGM-Eval/MathVista

Please put the pretrained data, finetuned data, and eval data in MGM-Pretrain, MGM-Finetune, and MGM-Eval subset following Structure.

For meta info, please download the following files and organize them as in Structure.

Data file name Size
mgm_pretrain.json 1.68 G
mgm_instruction.json 1.79 G
mgm_generation_pure_text.json 0.04 G

IMPORTANT: mgm_generation_pure_text.json is a generation-related subset. DO NOT merge it with mgm_instruction.json as it is already included in it. You may merge this file with your customized LLM/VLM SFT dataset to enable the reasoning generation ability.

Pretrained Weights

We recommend users to download the pretrained weights from the following link CLIP-Vit-L-336, OpenCLIP-ConvNeXt-L, Gemma-2b-it, Vicuna-7b-v1.5, Vicuna-13b-v1.5, Mixtral-8x7B-Instruct-v0.1, and Nous-Hermes-2-Yi-34B , and put them in model_zoo following Structure.

Structure

The folder structure should be organized as follows before training.

MGM
โ”œโ”€โ”€ mgm
โ”œโ”€โ”€ scripts
โ”œโ”€โ”€ work_dirs
โ”‚   โ”œโ”€โ”€ MGM
โ”‚   โ”‚   โ”œโ”€โ”€ MGM-2B
โ”‚   โ”‚   โ”œโ”€โ”€ ...
โ”œโ”€โ”€ model_zoo
โ”‚   โ”œโ”€โ”€ LLM
โ”‚   โ”‚   โ”œโ”€โ”€ gemma
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gemma-2b-it
โ”‚   โ”‚   โ”œโ”€โ”€ vicuna
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ 7B-V1.5
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ 13B-V1.5
โ”‚   โ”‚   โ”œโ”€โ”€ llama-3
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Meta-Llama-3-8B-Instruct
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Meta-Llama-3-70B-Instruct
โ”‚   โ”‚   โ”œโ”€โ”€ mixtral
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Mixtral-8x7B-Instruct-v0.1
โ”‚   โ”‚   โ”œโ”€โ”€ Nous-Hermes-2-Yi-34B
โ”‚   โ”œโ”€โ”€ OpenAI
โ”‚   โ”‚   โ”œโ”€โ”€ clip-vit-large-patch14-336
โ”‚   โ”‚   โ”œโ”€โ”€ openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ MGM-Pretrain
โ”‚   โ”‚   โ”œโ”€โ”€ mgm_pretrain.json
โ”‚   โ”‚   โ”œโ”€โ”€ images
โ”‚   โ”‚   โ”œโ”€โ”€ ALLaVA-4V
โ”‚   โ”œโ”€โ”€ MGM-Finetune
โ”‚   โ”‚   โ”œโ”€โ”€ mgm_instruction.json
โ”‚   โ”‚   โ”œโ”€โ”€ llava
โ”‚   โ”‚   โ”œโ”€โ”€ coco
โ”‚   โ”‚   โ”œโ”€โ”€ gqa
โ”‚   โ”‚   โ”œโ”€โ”€ ocr_vqa
โ”‚   โ”‚   โ”œโ”€โ”€ textvqa
โ”‚   โ”‚   โ”œโ”€โ”€ vg
โ”‚   โ”‚   โ”œโ”€โ”€ gpt4v-dataset
โ”‚   โ”‚   โ”œโ”€โ”€ sam
โ”‚   โ”‚   โ”œโ”€โ”€ share_textvqa
โ”‚   โ”‚   โ”œโ”€โ”€ wikiart
โ”‚   โ”‚   โ”œโ”€โ”€ web-celebrity
โ”‚   โ”‚   โ”œโ”€โ”€ web-landmark
โ”‚   โ”‚   โ”œโ”€โ”€ ALLaVA-4V
โ”‚   โ”‚   โ”œโ”€โ”€ docvqa
โ”‚   โ”‚   โ”œโ”€โ”€ chartqa
โ”‚   โ”‚   โ”œโ”€โ”€ dvqa
โ”‚   โ”‚   โ”œโ”€โ”€ ai2d
โ”‚   โ”œโ”€โ”€ MGM-Eval
โ”‚   โ”‚   โ”œโ”€โ”€ MMMU
โ”‚   โ”‚   โ”œโ”€โ”€ MMB
โ”‚   โ”‚   โ”œโ”€โ”€ MathVista
โ”‚   โ”‚   โ”œโ”€โ”€ ...

Train

The training process consists of two stages: (1) feature alignment stage: bridge the vision and language tokens; (2) instruction tuning stage: teach the model to follow multimodal instructions.

Our models are trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training.

NOTE: Please set hostfile for 2 machine training and hostfile_4 for 4 machine training.

If you want to train and finetune the framework, please run the following command for MGM-7B with image size 336:

bash scripts/llama/train/stage_1_2_full_v7b_336_hr_768.sh

or for MGM-13B with image size 336:

bash scripts/llama/train/stage_1_2_full_v13b_336_hr_768.sh

Because we reuse the pre-trained projecter weights from the MGM-7B, you can directly use the MGM-7B-HD with image size 672 for stage-2 instruction tuning:

bash scripts/llama/train/stage_2_full_v7b_672_hr_1536.sh

Please find more training scripts of gemma, llama, mixtral, and yi in scripts/.

Evaluation

We perform evaluation on several image-based benchmarks. Please download the evaluation data following Preparation and organize them as in Structure.

Model LLM Res. Link TextVQA MMB MME MM-Vet MMMU_val MMMU_test MathVista
MGM-2B Gemma-2B 336 ckpt 56.2 59.8 1341/312 31.1 31.7 29.1 29.4
MGM-7B Vicuna-7B-v1.5 336 ckpt 65.2 69.3 1523/316 40.8 36.1 32.8 31.4
MGM-13B Vicuna-13B-v1.5 336 ckpt 65.9 68.5 1565/322 46.0 38.1 33.5 37.0
MGM-8B LLaMA-3-8B-Instruct 336 ckpt 67.6 72.7 1606/341 47.3 38.2 36.3 --
MGM-8x7B Mixtral-8x7B-Instruct-v0.1 336 ckpt 69.2 75.6 1639/379 45.8 41.8 37.1 41.8
MGM-34B Nous-Hermes-2-Yi-34B 336 ckpt 70.1 79.6 1666/439 53.0 48.7 43.6 38.9
MGM-7B-HD Vicuna-7B-v1.5 672 ckpt 68.4 65.8 1546/319 41.3 36.8 32.9 32.2
MGM-13B-HD Vicuna-13B-v1.5 672 ckpt 70.2 68.6 1597/320 50.5 37.3 35.1 37.0
MGM-8B-HD LLaMA-3-8B-Instruct 672 ckpt 71.6 -- 1532/357 -- 37.0 -- --
MGM-8x7B-HD Mixtral-8x7B-Instruct-v0.1 672 ckpt 71.9 74.7 1633/356 53.5 40.0 37.0 43.1
MGM-34B-HD Nous-Hermes-2-Yi-34B 672 ckpt 74.1 80.6 1659/482 59.3 48.0 44.9 43.3

If you want to evaluate the model on image-based benchmarks, please use the scripts in scripts/MODEL_PATH/eval. For example, run the following command for TextVQA evaluation with MGM-7B-HD:

bash scripts/llama/eval/textvqa.sh

Please find more evaluation scripts in scripts/MODEL_PATH.

CLI Inference

Chat with images without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization. Please make sure you have installed diffusers and PaddleOCR (only for better experience with OCR), and try this for image and generation inference:

python -m mgm.serve.cli \
    --model-path work_dirs/MGM/MGM-13B-HD \
    --image-file <path to your image>

or try this better experience with OCR (make sure you have installed PaddleOCR):

python -m mgm.serve.cli \
    --model-path work_dirs/MGM/MGM-13B-HD \
    --image-file <path to your image> \
    --ocr

or try this for inference with generation (make sure you have installed diffusers):

python -m mgm.serve.cli \
    --model-path work_dirs/MGM/MGM-13B-HD \
    --image-file <path to your image> \
    --gen

You can also try 8bit or even 4bit for efficient inference

python -m mgm.serve.cli \
    --model-path work_dirs/MGM/MGM-13B-HD \
    --image-file <path to your image> \
    --gen
    --load-8bit

Gradio Web UI

Here, we adopt the Gradio UI similar to that in LLaVA to provide a user-friendly interface for our models. To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server ONCE.

Launch a controller

python -m mgm.serve.controller --host 0.0.0.0 --port 10000

Launch a gradio web server.

python -m mgm.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload

You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.

Launch a model worker

This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path.

python -m mgm.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path work_dirs/MGM/MGM-13B-HD

Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.

You can launch as many workers as you want, and compare between different models in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

python -m mgm.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port <different from 40000, say 40001> --worker http://localhost:<change accordingly, i.e. 40001> --model-path work_dirs/MGM/MGM-34B-HD

If you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the --device flag: --device mps.

Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)

If the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can specify which GPUs to use with CUDA_VISIBLE_DEVICES. Below is an example of running with the first two GPUs.

CUDA_VISIBLE_DEVICES=0,1 python -m mgm.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path work_dirs/MGM/MGM-13B-HD

Launch a model worker (4-bit, 8-bit inference, quantized)

You can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append --load-4bit or --load-8bit to the model worker command that you are executing. Below is an example of running with 4-bit quantization.

python -m mgm.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path work_dirs/MGM/MGM-13B-HD --load-4bit

Examples

We provide some examples in this section. More examples can be found in our project page.

Hi-Resolution Understanding

Generation with Reasoning

Citation

If you find this repo useful for your research, please consider citing the paper

@article{li2024mgm,
  title={Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models},
  author={Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya},
  journal={arXiv:2403.18814},
  year={2023}
}

Acknowledgement

This project is not affiliated with Google LLC.

We would like to thank the following repos for their great work:

License

Code License Data License Weight License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

mgm's People

Contributors

eltociear avatar julianjuaner avatar lightmatmul avatar wcy1122 avatar yanwei-li avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mgm's Issues

Not find vision tower: model_zoo/OpenAI/clip-vit-large-patch14-336

Hi, Nice work! but when i run the example code , i have encounter some problems,when i run python -m minigemini.serve.cli --model-path YanweiLi/Mini-Gemini-13B-HD --image-file Woman.jpg ,i got the error ValueError: Not find vision tower: model_zoo/OpenAI/clip-vit-large-patch14-336 ,how can i solve this problem thank u!

Possible positional embedding in Patch Info Mining process

Thanks for sharing your great work. I have a small question about the attention process in the patch info mining module. I found that there is no positional embedding for the high-resolution tokens to indicate their position in the corresponding patch. Do you think adding positional embedding here would be helpful and have you tried this?

ๅ…ณไบŽtokenๆ•ฐ้‡็š„็–‘้—ฎ

้’ˆๅฏนllama2-7bๆจกๅž‹๏ผŒmax token lengthไธบ2048๏ผŒๆŒ‰็…งstage2 ็š„่ฎญ็ปƒๅ‚ๆ•ฐ๏ผŒไธ€ๆ—ฆ่ฎพ็ฝฎIMAGE_GRID=2ๅ’ŒIMAGE_GLOBALไธบTrue๏ผŒ
image_features = torch.cat([image_feat_global, image_features], dim=1)
image_aux_features = torch.cat([image_aux_feat_global, image_aux_features], dim=1)
ไธค่กŒๆœ€็ปˆๅพ—ๅˆฐ็š„ๅ›พๅƒ็‰นๅพtokenๆ•ฐๅฐฑๅ˜ๆˆไบ†2880๏ผŒ่ฟ™้‡Œtokenๆ•ฐไธๆ˜ฏ่ถ…ไบ†ๅ—๏ผŸๅฆ‚ๆžœ็†่งฃๆœ‰่ฏฏๆ•ฌ่ฏทๆŒ‡ๆญฃใ€‚

Can provide laion-gpt4v dataset images zip?

Hi, after downloaded the laion-gpt4v images, I got only: 11686 images, am using json index order as image name, to avoid the bias between dataset, and possibly wrong annotation to image, just to be sure, does the last image index is:
image

?

Why does any stage didn't optimize vision encoder?

Nice work, from the scripts provided, seems all optimize_vision_tower either aux were not used.

Then I have some puzzels:

  1. Is that optimize get worse results?
  2. If not finetune the vision tower, how to make it learn new visual tokens?

ๅ…ณไบŽไปฃ็ ๅฎž็Žฐ็š„็–‘้—ฎ

ไฝœ่€…ๅฅฝ๏ผŒ้žๅธธๆ„Ÿ่ฐขไฝ ็š„ๅผ€ๆบใ€‚ๆˆ‘ๆœ‰ไธคไธช็–‘้—ฎๆƒณๅ’จ่ฏขไธ‹

1. convnext ็š„ drop path ้—ฎ้ข˜

convnext ็š„ drop path ๆ˜ฏ 0.1๏ผŒ่™ฝ็„ถไฝ ๅœจ่ฎญ็ปƒๆ—ถๅ€™่ฎพ็ฝฎ็š„ๆ˜ฏไธ่ฎญ็ปƒ๏ผŒไฝ†ๆ˜ฏ่ฟ™ไธช่ทฏๅพ„่ฟ˜ๆ˜ฏไผšๆ‰ง่กŒ็š„ใ€‚ๅŽŸๅˆ™ไธŠๅœจ่ฎญ็ปƒๆ—ถๅ€™๏ผŒconvnext ่ฆ่ฎพ็ฝฎไธบ eval ๆจกๅผๅง๏ผŸ ๆˆ‘ๆฒกๆœ‰ๆ‰พๅˆฐ็›ธๅ…ณไปฃ็ ๏ผŒๆœ‰็‚นๅฅ‡ๆ€ช๏ผŒๆƒณไบ†่งฃไธ‹

2. ไปฃ็ ้ฒๆฃ’ๆ€ง้—ฎ้ข˜

ๅฆ‚ๆžœๆ˜ฏ clip+convnext ็ป„ๅˆๆ–นๅผ่ฎญ็ปƒๆฒกๆœ‰ไปปไฝ•้—ฎ้ข˜ใ€‚ไฝ†ๆ˜ฏๆˆ‘ๆƒณๅฐ่ฏ• siglip๏ผŒๅ‘็Žฐๅœจไป…ไป…ๅฐ† clip ๆขๆˆ siglip ๅŽ(ๅ‡ๅ€ผๅ’Œๆ–นๅทฎไนŸๆ˜ฏไธคๅฅ—)๏ผŒๆจกๅž‹ๅœจ iter=2 ๆ—ถๅ€™ๅฐฑไผšๅ‡บ็Žฐ nanใ€‚ๆณจ้‡Šๅฆ‚ไธ‹

        # token attention
        embed_query = self.vlm_uni_query_projector(images)
        embed_aux = self.vlm_uni_aux_projector(images_aux)
        embed_value = self.vlm_uni_val_projector(images_aux)
        # TODO siglip+convnext ๅœจ็ฌฌไธ€ๆฌก forward ๅŽๆญฃๅธธ๏ผŒไฝ†ๆ˜ฏ embed_att ไผšๅ‡บ็Žฐ nan
        # TODO ๅฏผ่‡ด็ฌฌไบŒๆฌก่ฟญไปฃๆ—ถๅ€™ embed_value ไผšๅ‡บ็Žฐ nan๏ผŒๆ— ๆณ•่ฎญ็ปƒ
        # TODO ๆ€€็–‘ๆ˜ฏ็‰นๅพไธๅŒน้…๏ผŒๅณไฝฟๅ…จ้ƒจ่ฝฌๆขไธบ fp32 ไนŸไผšๅ‡บ็Žฐ nan, ้œ€่ฆ่ฟ›ไธ€ๆญฅๆŽ’ๆŸฅ
        embed_att = embed_query[:, :, None] @ (embed_aux.transpose(-1, -2) / (embed_aux.shape[-1] ** 0.5))
        # print('=xxxx=', torch.any(torch.isnan(embed_query)).item(),
        #       torch.any(torch.isnan(embed_aux)).item(),
        #       torch.any(torch.isnan(embed_value)).item(),
        #       torch.any(torch.isnan(embed_att)).item())
        embed_att = embed_att.nan_to_num()
        embed_feat = (embed_att.softmax(-1) @ embed_value).mean(2)
        # print('=xxcccxx=', torch.any(torch.isnan(embed_feat)).item())
        image_features = images + embed_feat
        return image_features

ไธ็Ÿฅ้“ไฝœ่€…ๅฏนไบŽ่ฟ™ไธช้—ฎ้ข˜ๆ˜ฏๅ’‹็œ‹็š„ใ€‚ๆœ‰ไธ€ไธช็ป†่Š‚ไธไธ€ๆ ท๏ผšๅ› ไธบ siglip ่พ“ๅ…ฅๆ˜ฏ 384x384 ็š„๏ผŒ่พ“ๅ‡บๅฐบๅบฆๆ˜ฏ 27x27๏ผŒๆ‰€ไปฅๆˆ‘ๅฟ…้กป่ฆๅฐ† convnext ่พ“ๅ…ฅๅˆ†่พจ็Ž‡่ฎพ็ฝฎไธบ 864, ่ฟ™ๆ ทๆ‰ๅฏไปฅไฟ่ฏไธค่€…็ฉบ้—ดๅฎŒๅ…จๅฏน้ฝใ€‚

ๆœŸๅพ…ๆ‚จ็š„ๅ›žๅคใ€‚

minigemini_instruction.jsonๅŒ…ๅซไบ†้ข„่ฎญ็ปƒ็š„LLaVA Images๏ผŒไฝ†ๆ˜ฏๅœจreadmeไธญๆฒกๆœ‰ๅ†™ๅˆฐๅพฎ่ฐƒ็”จๅˆฐไบ†่ฟ™้ƒจๅˆ†ๆ•ฐๆฎ

minigemini_instruction.jsonไธญๆœ‰ๆก็›ฎ็š„imageไธบllava/LLaVA-Pretrain/images/00013/000133305.jpg๏ผŒๆœ‰ๅพˆๅคšllava็š„ๅ›พ็‰‡๏ผŒไฝ†ๆ˜ฏreadmeไธญๆฒกๆœ‰ๆๅˆฐ่ฆ็”จllava็š„ๅ›พ็‰‡๏ผŒๆ˜ฏไฝ ไปฌ็–ๅฟฝไบ†ๅ—

images_aux encode error

When using clip.py for inference, I encountered the following error. How should I solve it?

File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/mini_gemini_arch.py", line 255, in encode_images
if images_aux is not None:
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 58, in forward
image_features = self.image_forward(images)
File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 50, in image_forward
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 917, in forward
return self.vision_model(
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 841, in forward
hidden_states = self.embeddings(pixel_values)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 187, in forward
embeddings = embeddings + self.position_embedding(self.position_ids)
RuntimeError: The size of tensor a (2305) must match the size of tensor b (50) at non-singleton dimension 1

batch inference

Hi could you tell me how I can do batch inferencing?
I have multiple images and a different prompt for eacg image, so is there a way i can get the output in one go ?

Questions about how to enlarge the base vision tower input resolution

Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.

I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).

Could that be possible? Can u give me some advisor how to adopt it?

AutoTokenizer resolve error

Simply run the demo code as following fails:

python -m minigemini.serve.cli \
    --model-path YanweiLi/Mini-Gemini-2B \
    --image-file ./images/demo_gen.png

error message:

  File "XXXXX/envs/minigemini/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 784, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class GemmaTokenizer does not exist or is not currently imported.

Update to newest transformers fix this problem

batching giving weird outputs

Hi I noticed that when doing batch inference with a static prompt like โ€˜Describe the imageโ€™ the model gives wrong output like โ€˜in detailโ€™, like it is just doing sentence completion. Whereas if I try a more descriptive prompt, where I tell minigemini that it is a โ€˜prompt generatorโ€™ then it goes into sentence completion mode and gives me an okayish response.

However, I have the original image description also, so I tried adding those in the prompt and then asking the model to describe the image, given the information about the image. This works perfectly fine when I just use 1 image at a time.

But when doing batch processing I get completely garbage outputs. To do batch processing, I use padding to get the prompts to the same shape. But this gives me completely garbage output.
I do this by changing line 44 in / MiniGemini/minigemini/mm_utils.py
to
tokenizer(chunk, padding='max_length', max_length=max_len).input_ids for chunk in prompt.split('')

Could you give me any advice on how to do this effectively?

performing finetune on top of mini-gemini-8x7b-HD

when performing finetune on top of the minigemini-8x7b-HD model using the following config:

#!/bin/bash
PRETRAIN_NAME=Mini-Gemini-8x7B-Pretrain
FINETUNE_NAME=Mini-Gemini-8x7B-HD
AUX_SIZE=1536
IMAGE_GRID=2
IMAGE_GLOBAL=True
LR_MULTI="model.mm_projector:2,model.vlm_uni:2"

# delete --hostfile hostfile_4 and change --per_device_train_batch_size if trained on single machine

deepspeed minigemini/train/train_mem.py \
    --deepspeed ./scripts/zero3_offload.json \
    --model_name_or_path Mini-Gemini-8x7B-HD \
    --version mistral_instruct \
    --data_path data/minigemini_instruction.json \
    --image_folder data/figures \
    --vision_tower model_zoo/OpenAI/clip-vit-large-patch14-336 \
    --vision_tower_aux model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup \
    --image_grid $IMAGE_GRID \
    --image_global $IMAGE_GLOBAL \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --image_size_aux $AUX_SIZE \
    --bf16 True \
    --output_dir ./work_dirs/experiment-2/$FINETUNE_NAME \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --save_strategy "steps" \
    --save_steps 2 \
    --save_total_limit 1 \
    --learning_rate 8e-4 \
    --lr_multi $LR_MULTI \
    --weight_decay 0. \
    --warmup_ratio 0.05 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 8192 \
    --gradient_checkpointing True \
    --dataloader_num_workers 64 \
    --lazy_preprocess True \
    --report_to wandb

The final model only has 4 safetensors files instead of 20, why is that ? thanks.

evaluate MMMU_val of 2B error

Hi, I try evaluating MMMU_val of 2B by running this command

 bash /home/user/minigemini/scripts/gemma/eval/mmmu.sh

but get error log as follows while evaluation of MMMU_val on 7B, 13B and 34B models could run successfully.

Traceback (most recent call last):
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/run_llava.py", line 207, in <module>
    main()                               
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/run_llava.py", line 158, in main
    response = call_model_engine(args, sample, model, tokenizer, processor)
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/utils/model_utils_ind.py", line 53, in call_llava_engine_df
    output_ids = model.generate(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/MiniGemini/minigemini/model/language_model/mini_gemini_gemma.py", line 144, in generate
    return super().generate(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/generation/utils.py", line 1648, in generate
    result = self._beam_sample(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/generation/utils.py", line 3402, in _beam_sample
    outputs = self(                      
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/MiniGemini/minigemini/model/language_model/mini_gemini_gemma.py", line 97, in forward
    return super().forward(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 1105, in forward
    outputs = self.model(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 891, in forward
    causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 984, in _update_causal_mask
    causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
RuntimeError: The size of tensor a (701) must match the size of tensor b (0) at non-singleton dimension 0
scripts/gemma/eval/mmmu.sh: line 31: MMMU/answers/Mini-Gemini-2B/merge.jsonl: No such file or directory
scripts/gemma/eval/mmmu.sh: line 35: MMMU/answers/Mini-Gemini-2B/merge.jsonl: No such file or directory
Traceback (most recent call last):
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/eval.py", line 31, in <module>
    main()                               
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/eval.py", line 17, in main
    out_samples = [json.loads(line) for line in open(args.result_file)]
FileNotFoundError: [Errno 2] No such file or directory: 'MMMU/answers/Mini-Gemini-2B/merge.jsonl'

Begin training loss a little high

Hi, the begin loss is a little high, does it normal?

{'loss': 6.7177, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.7738, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.801, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                              
{'loss': 6.5226, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.813, 'learning_rate': 7.633587786259541e-06, 'epoch': 0.0}                                                                                                                                                            
{'loss': 6.6355, 'learning_rate': 1.5267175572519083e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 6.3212, 'learning_rate': 2.2900763358778628e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 6.1449, 'learning_rate': 3.0534351145038166e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.8368, 'learning_rate': 3.816793893129771e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.9083, 'learning_rate': 4.5801526717557256e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.7246, 'learning_rate': 5.3435114503816794e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.6311, 'learning_rate': 6.106870229007633e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.7906, 'learning_rate': 6.870229007633588e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.6311, 'learning_rate': 7.633587786259542e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.1357, 'learning_rate': 8.396946564885496e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.9726, 'learning_rate': 9.160305343511451e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.7747, 'learning_rate': 9.923664122137405e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.8886, 'learning_rate': 0.00010687022900763359, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.7224, 'learning_rate': 0.00011450381679389313, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.4193, 'learning_rate': 0.00012213740458015266, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.5041, 'learning_rate': 0.00012977099236641222, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.3678, 'learning_rate': 0.00013740458015267177, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.3701, 'learning_rate': 0.0001450381679389313, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.505, 'learning_rate': 0.00015267175572519084, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.5022, 'learning_rate': 0.00016030534351145037, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.2254, 'learning_rate': 0.00016793893129770992, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.339, 'learning_rate': 0.00017557251908396944, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.5537, 'learning_rate': 0.00018320610687022902, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.2307, 'learning_rate': 0.00019083969465648857, 'epoch': 0.01}   

How's the loss curve on your side? (am using a 4B llm model

size mismatch about convnext_large_d_320

gs43MdFlo1
image

Hello, I downloaded the pre-trained model based on the readme and placed it in the corresponding location. Which step did I have a problem with? Please help.

Pretrain data not found in AllaVA

Hi, the pretrained data used allava images both from laion nad vfan.

But the laion part image names are totally different from ALLava's images format.

I tried to found:

465440.jpeg
320609

they all used in minigemini_pretrain.json but can not be found in ALLava images folder.

ls -f images | grep 465440
46544031.jpeg
(base) โžœ  allava_laion git:(main) โœ— ls -f images | grep 320609     
132060956.jpeg
43206091.jpeg

why is that๏ผŸ

Fixing the pretrain and sft stage ALLaVa images issue.

Hi, as previous issue raised, the data all wrong for any data from ALLaVA, they have changed the image names.

What even worse, even tough using url mapping the image names with latest ALLaVa's caption json, still some images were totally can not be found, in latest ALLaVa.

for example:

cat ALLaVA-Caption-LAION-4V.json | grep 'https://slideplayer.it/slide/553401/1/images/40/Delayed+relaxation+filling+pattern.jpg' -C 9

this url existed in minigemini but gone at latest ALLava images

Therefore, please help fix the data first, it emergency. Stops anyone want to reveal the minigemini training and scores.

How many SAM images were used from ShareGPT4v?

I downloaded sharegpt4v used fientuneing part data, but always got image not found, am using finetune stage.

Does finetune used sharegpt4v pretrain data?

sharegpt4v finetune just used very little data from SAM.

Shall we download the. sam_000000 - 0000050 whole 500GB images for it?

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory model_zoo/OpenAI/clip-vit-large-patch14-336.

Hello, everyone, I am using MiniGemini evaluation on an image by typing command:

python -m minigemini.serve.cli  --model-path ./Mini-Gemini-2B/     --image-file replaced_with_path_to_image

then following OSError emerged:

Traceback (most recent call last):
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/serve/cli.py", line 237, in <module>
    main(args)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/serve/cli.py", line 56, in main
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/model/builder.py", line 112, in load_pretrained_model
    vision_tower.load_model()
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 33, in load_model
    self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name)
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3144, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory model_zoo/OpenAI/clip-vit-large-patch14-336.

Some internet friends share their solutions: Downloading the corresponding "xxx.index.json" file, but I can't find any "xxx.index.json" file of "clip-vit-large-patch14-336" on the huggingface website.

I think maybe the relative path leads to the probelm, so I replace it with absolute path, but the same problem is shot.

My environment:
OS: ubuntu 22.04 64 bit
python: 3.10.14
others: other libraries are installed according to MiniGemini's official installation guide.

Does anybody have a solution for that, I will be very grateful for you, thank u.

AttributeError: 'MiniGeminiLlamaModel' object has no attribute 'vlm_uni_query_projector'

when i was running minigemini.serve.cli to inference an image๏ผˆminigemini_34b_hd๏ผ‰๏ผŒi got this error๏ผšAttributeError: 'MiniGeminiLlamaModel' object has no attribute 'vlm_uni_query_projector'.
Should i put all the following pretrainded models to the specific directory๏ผŸor just need some of them.
image
Looking forward to your reply
Thanks

How to get 13K generation-related instructions dataset?

Dear author:

Thanks for your interesting work.

But I am still confused about following questions:

  1. How to get 13K generation-related instructions dataset?
  2. How to only tuning LLM for generation (or tuning LLM for understanding and generation separately)

Looking forward to your reply~
Thanks!!

issues running cli inference example

command: python -m minigemini.serve.cli --model-path Mini-Gemini-34B-HD --image-file test_image.png

  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/paperspace/MiniGemini/minigemini/serve/cli.py", line 241, in <module>
    main(args)
  File "/home/paperspace/MiniGemini/minigemini/serve/cli.py", line 200, in main
    output_ids = model.generate(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/MiniGemini/minigemini/model/language_model/mini_gemini_llama.py", line 183, in generate
    return super().generate(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1575, in generate
    result = self._sample(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2697, in _sample
    outputs = self(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
TypeError: MiniGeminiLlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'```

Continue FT from stage 2 with custom data

Hi, was wondering whether the stage2 script would be applicable for further FT from stage 2 with a small custom dataset for domain transfer? Or do we have to rewrite a separate script to do this?

Thanks and appreciate any help given!

Regards,

Adriel

Training model got some error

TypeError: pad_sequence(): argument 'padding_value' (position 3) must be float, not NoneType

Am using Qwen as LLM, got above error, what could be the reason? I have checked:

if tokenizer.pad_token_id == None:
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.unk_token
        tokenizer.pad_token_id = tokenizer.encode(tokenizer.pad_token)

didnt make it work, any help could be possible?

Training on customized LLM loss got sort of high

Hi, am using Qwen2 4b as LLM training the model, the template solving way I previous tried on LLaVa works OK, and no warnings appeared.

But when pretrain on minigemini, got unexpected loss result:

{'loss': 1.9597, 'learning_rate': 0.0007436649460805199, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8565, 'learning_rate': 0.0007435934693368483, 'epoch': 0.36}                                                                                                                                                            
{'loss': 2.0595, 'learning_rate': 0.0007435219860653267, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9304, 'learning_rate': 0.0007434504962678705, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9705, 'learning_rate': 0.0007433789999463957, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9536, 'learning_rate': 0.000743307497102818, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.9405, 'learning_rate': 0.0007432359877390538, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8501, 'learning_rate': 0.0007431644718570192, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8722, 'learning_rate': 0.000743092949458631, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.6984, 'learning_rate': 0.0007430214205458056, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8309, 'learning_rate': 0.0007429498851204598, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8681, 'learning_rate': 0.0007428783431845109, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9383, 'learning_rate': 0.0007428067947398757, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8059, 'learning_rate': 0.000742735239788472, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.8016, 'learning_rate': 0.0007426636783322172, 'epoch': 0.36}                                                                                                                                                            
{'loss': 2.0427, 'learning_rate': 0.0007425921103730288, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.899, 'learning_rate': 0.0007425205359128248, 'epoch': 0.36}  

Looks like stucked at ~1.8, no longer decrease, what could be the reason for this? (the data all same)

(I tried using both chat template in pretrain and finetune, same as mini does)

TypeError: MiniGeminiMixtralForCausalLM.forward() got an unexpected keyword argument 'output_router_logits'

Hi, I'm trying to use MiniGemini outside of the demo environment, but am running into the following error when calling model.generate():

  File "/app/backend/minigemini.py", line 104, in chat_with_images
    output_ids = self.model.generate(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/MiniGemini/minigemini/model/language_model/mini_gemini_mixtral.py", line 142, in generate
    return super().generate(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
              ^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: MiniGeminiMixtralForCausalLM.forward() got an unexpected keyword argument 'output_router_logits'

I have MiniGemini-2B working, but Mixtral is still giving me some trouble. The only relevant reference I can find is: https://www.opensourceagenda.com/projects/transformers/versions, which mentions output_router_logits was removed in transformers 4.39.0.

I see a similar error with Mini-Gemini-7B:

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: MiniGeminiLlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'

I'm using:
transformers 4.39.3
accelerate-0.29.1
torch 2.2.2
torchvision 0.17.2

I confess, the dependencies for MiniGemini are a real challenge to integrate, but I hope it can get resolved.

Similarities to LLaVA-HR

Congratulations on your great work and solid performance! However, we notice that your core idea and design are highly similar to our previsiou work LLaVA-HR, especially the dual visual pathways for MLLMs. We would like to see simple clarifications and discussions with our work in your paper. Thank you!

่ฟ่กŒไปฃ็ ๆŠฅ้”™AttributeError: 'list' object has no attribute 'to'๏ผŒ image_aux_features_raw = self.get_model().get_vision_tower_aux()(images_aux).to(dtype=image_features.dtype, device=image_features.device)

Traceback (most recent call last):
File "/checkpoint/binary/train_package/minigemini/train/train_mem.py", line 14, in
train(attn_implementation="flash_attention_2")
File "/checkpoint/binary/train_package/minigemini/train/train.py", line 1262, in train
trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2902, in training_step
loss = self.compute_loss(model, inputs)
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2925, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1833, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/checkpoint/binary/train_package/minigemini/model/language_model/mini_gemini_gemma.py", line 87, in forward
) = self.prepare_inputs_labels_for_multimodal(
File "/checkpoint/binary/train_package/minigemini/model/mini_gemini_arch.py", line 328, in prepare_inputs_labels_for_multimodal
image_features = self.encode_images(images, images_aux)
File "/checkpoint/binary/train_package/minigemini/model/mini_gemini_arch.py", line 255, in encode_images
image_aux_features_raw = self.get_model().get_vision_tower_aux()(images_aux).to(
AttributeError: 'list' object has no attribute 'to'

4 bit loading fails

I have tried both the model worker and the cli, and both when passed 4bit loading just fail with the error message:

Loading pretrained weights (convnext_large_d_320).
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx/minigemini/serve/cli.py", line 234, in <module>
    main(args)
  File "/xxx/minigemini/serve/cli.py", line 56, in main
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)
  File "/xxx/minigemini/model/builder.py", line 124, in load_pretrained_model
    model.get_model().initialize_uni_modules(model.config, for_eval=True)
  File "/xxx/minigemini/model/mini_gemini_arch.py", line 213, in initialize_uni_modules
    get_w(projector_weights, 'vision_tower.vision_tower', self.vision_tower, 'vision_tower')
  File "/xxx/minigemini/model/mini_gemini_arch.py", line 209, in get_w
    getattr(main_module, sub_module).to(device=device_type, dtype=weight_type)
  File "/xxx/venv/lib64/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
    return super().to(*args, **kwargs)
  File "/xxx/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in to
    raise TypeError('nn.Module.to only accepts floating point or complex '
TypeError: nn.Module.to only accepts floating point or complex dtypes, but got desired dtype=torch.uint8

loading with 8bit works, but OOMs on my hardware. (24+24 vram)

Limitations on current image generating method

Hi, I found currently the image generating way hard to make input image as reference:

image

any thought about it?

BTW, I found the pretrain stage of this loss is very high:

{'loss': 2.573, 'learning_rate': 0.0007701925673852566, 'epoch': 0.34}                                                                                                                                                           
{'loss': 2.8217, 'learning_rate': 0.0007698799612970509, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.5646, 'learning_rate': 0.0007695672062744539, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6857, 'learning_rate': 0.0007692543024900611, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6944, 'learning_rate': 0.0007689412501165496, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6418, 'learning_rate': 0.0007686280493266786, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6801, 'learning_rate': 0.0007683147002932893, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.7245, 'learning_rate': 0.0007680012031893049, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6275, 'learning_rate': 0.0007676875581877296, 'epoch': 0.34}

3 gradio issues

There's quite a few gradio issues:

  1. function_markdown is undefined
  2. unexpected keyword concurrency_limit
  3. recursive json encoder

First two were "fixed" by just commenting them out, however the third issue prevent gradio working at all as it immediately crashes the instance with that error.

/xxx//minigemini/serve/gradio_web_server.py:351: UserWarning: `layout` parameter is deprecated, and it has no effect
  chatbot = gr.Chatbot(
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx/minigemini/serve/gradio_web_server.py", line 472, in <module>
    demo = build_demo(args.embed, concurrency_count=args.concurrency_count)
  File "/xxx/minigemini/serve/gradio_web_server.py", line 371, in build_demo
    gr.Markdown(function_markdown)
/xxx//minigemini/serve/gradio_web_server.py:351: UserWarning: `layout` parameter is deprecated, and it has no effect
  chatbot = gr.Chatbot(
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx//minigemini/serve/gradio_web_server.py", line 472, in <module>
    demo = build_demo(args.embed, concurrency_count=args.concurrency_count)
  File "/xxx//minigemini/serve/gradio_web_server.py", line 394, in build_demo
    regenerate_btn.click(
TypeError: EventListenerMethod.__call__() got an unexpected keyword argument 'concurrency_limit'
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 331, in jsonable_encoder
  return jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 331, in jsonable_encoder
  return jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 318, in jsonable_encoder
  if isinstance(obj, classes_tuple):
File "/usr/lib64/python3.10/abc.py", line 119, in __instancecheck__
  return _abc_instancecheck(cls, instance)
RecursionError: maximum recursion depth exceeded in comparison

any idea coping with long video input?

Dear author:
Thanks for publishing the mini gemini paper. As Gemini 1.5 support up to one hour long video input for 1fps sampling, I wonder how to adopt your framework to support long video training and inference? thank you.

No image generated while run python -m minigemini.serve.cli --gen

Dear author:

Thanks for your interesting work.

When I run the following command with the input What is unusual about this image?, no image generated in the output:

python -m minigemini.serve.cli \
    --model-path work_dirs/Mini-Gemini/Mini-Gemini-2B \
    --image-file examples/extreme_ironing.jpg \
    --gen

image

I wonder if the Mini-Gemini-2B model does not have the ability to generate images?

And if fine-tuning is needed, which datasets should be used to make the model output <h> ... </h>?

Thanks!!

CLI of 2B does not work

How to reproduce:

  1. Install this repo as README.md
  2. run the following commands
export HF_ENDPOINT=https://hf-mirror.com

python -m minigemini.serve.cli \
    --model-path ./Mini-Gemini/Mini-Gemini-2B \
    --image-file ./images/demo_gen.png  \
    --debug \

Results:

The model does not generate anything since the stop_str is set to be EMPTY STRING ''. After I fixed this, I also found the chat history is not preserved in the prompt which results the multi-turn conversation results to be unexpected.

datasets preparation

There are some image sources missing in the preparation stage:
sam/images/: 19982
wikiart/images/: 500
share_textvqa/images/: 500
web-celebrity/images/: 498
web-landmark/images/: 500
Could you help add these datasets?

loss suddenly drop to 0 and remain 0

ๅŸบไบŽๆไพ›็š„1 stage้ข„่ฎญ็ปƒๆจกๅž‹Mini-Gemini-7B-Pretrainๅ’Œ้ป˜่ฎค็š„้…็ฝฎscripts/llama/train/stage_2_full_v7b_672_hr_1536.sh๏ผŒๅ•็‹ฌไฝฟ็”จminigemini_generation_pure_text.json ่ฎญ็ปƒ2 stage๏ผŒlossๅœจ่ฎญ็ปƒ่ฟ‡็จ‹ไธญ๏ผˆ10ไธชstepๅŽ๏ผ‰็ช็„ถ้™่‡ณ0๏ผŒๅœจๅŽ็ปญ่ฎญ็ปƒ่ฟ‡็จ‹ไธญๆŒ็ปญไธบ0๏ผŒ

image

@yanwei-li @yukang2017 @wcy1122 ๆ‰“ๅฐๅ‡บไธญ้—ดๅ˜้‡๏ผŒ็œ‹ๅˆฐshift_logitsไธบNanๅฏผ่‡ดlossๅ‡บ็ŽฐไธบNan็š„ๆƒ…ๅ†ต๏ผŒ่ฟ™ไธชๆญฃๅธธๅ—๏ผŸ

image

Mini-Gemini-2B evaluation error

Hi, I follow the instructions to prepare the data and model and get the following error when evaluating the Mini-Gemini-2B.

AttributeError: 'OpenCLIPVisionTower' object has no attribute 'vision_stem'

Questions about change ViT to 378 input resolution, but got poor results.

Hi, am alreaady tried using vit336 and convnext + Qwen LLM, which is great, and really got a good performance.

But when I try using another CLIP vit model with input size is 378, rest things are same (include traning data) the result are extremly poor.

To precisely:

  1. the loss are lower, normally I got 0.9-1.0 , but using CLIP with input size 378, the loss can to 0.7-0.8, but the inference result are very poor;
  2. The CLIP model I used was Apple's DNFS_vit_G_378 model.
  3. I have changed the convnext input resuoltion accordingly.

Any reason for this? This is really weired, better and larger ViT got bad results.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.