dvlab-research / mgm Goto Github PK

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

License: Apache License 2.0

Python 85.01% HTML 1.38% JavaScript 1.80% CSS 0.33% Shell 11.48%

generation large-language-models vision-language-model

mgm's Introduction

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

The framework supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B with image understanding, reasoning, and generation simultaneously. We build this repo based on LLaVA.

Release

[05/03] 🔥 We support LLaMA3-based models! Welcome to try them here.
[04/15] 🔥 The Hugging Face demo is available. It's a 13B-HD version, welcome to watch and try.
[03/28] 🔥 Mini-Gemini is coming! We release the paper, demo, code, models, and data!

Demo
Install
Model
Preparation
Train
Evaluation
Examples
Citation
Acknowledgement
License

Demo

We provide some selected examples in this section. More examples can be found in our project page. Feel free to try our online demo!

Install

Please follow the instructions below to install the required packages.

NOTE: If you want to use the 2B version, please ensure to install the latest version Transformers (>=4.38.0).

Clone this repository

git clone https://github.com/dvlab-research/MGM.git

Install Package

conda create -n mgm python=3.10 -y
conda activate mgm
cd MGM
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install ninja
pip install flash-attn --no-build-isolation

Model

The framework is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.

We provide all our fully finetuned models on Stage 1 and 2 data:

Model	LR	HR	Base LLM	Vision Encoder	Finetuning Data	Finetuning schedule	Download
MGM-2B	336	768	Gemma-2B	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-7B	336	768	Vicuna-7B-v1.5	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-13B	336	768	Vicuna-13B-v1.5	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-8B	336	768	LLaMA-3-8B-Instruct	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-8x7B	336	768	Mixtral-8x7B-Instruct-v0.1	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-34B	336	768	Nous-Hermes-2-Yi-34B	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-7B-HD	672	1536	Vicuna-7B-v1.5	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-13B-HD	672	1536	Vicuna-13B-v1.5	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-8B-HD	672	1536	LLaMA-3-8B-Instruct	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-8x7B-HD	672	1536	Mixtral-8x7B-Instruct-v0.1	CLIP-L	MGM-Instruct	full_ft-1e	ckpt
MGM-34B-HD	672	1536	Nous-Hermes-2-Yi-34B	CLIP-L	MGM-Instruct	full_ft-1e	ckpt

Here are the pretrained weights on Stage 1 data only:

Model	LR	HR	Base LLM	Vision Encoder	Pretrain Data	Finetuning schedule	Download
MGM-2B	336	768	Gemma-2B	CLIP-L	MGM-Pretrain	1e	ckpt
MGM-7B	336	768	Vicuna-7B-v1.5	CLIP-L	MGM-Pretrain	1e	ckpt
MGM-13B	336	768	Vicuna-13B-v1.5	CLIP-L	MGM-Pretrain	1e	ckpt
MGM-8x7B	336	768	Mixtral-8x7B-Instruct-v0.1	CLIP-L	MGM-Pretrain	1e	ckpt
MGM-34B	336	768	Nous-Hermes-2-Yi-34B	CLIP-L	MGM-Pretrain	1e	ckpt

Preparation

Dataset

We provide the processed data for the model training. For model pretraining, please download the following the training image-based data and organize them as:

-> means put the data in the local folder.

LLaVA Images -> data/MGM-Pretrain/images, data/MGM-Finetune/llava/LLaVA-Pretrain/images
ALLaVA Caption -> data/MGM-Pretrain/ALLaVA-4V

For model finetuning, please download the following the instruction data and organize them as:

-> means put the data in the local folder.

COCO train2017 -> data/MGM-Finetune/coco
GQA -> data/MGM-Finetune/gqa
OCR-VQA (we save all files as .jpg) -> data/MGM-Finetune/ocr_vqa
TextVQA (not included for training) -> data/MGM-Finetune/textvqa
VisualGenome part1, VisualGenome part2 -> data/MGM-Finetune/vg
ShareGPT4V-100K -> data/MGM-Finetune/sam, share_textvqa, wikiart, web-celebrity, web-landmark
LAION GPT4V -> data/MGM-Finetune/gpt4v-dataset
ALLaVA Instruction -> data/MGM-Pretrain/ALLaVA-4V
DocVQA -> data/MGM-Finetune/docvqa
ChartQA -> data/MGM-Finetune/chartqa
DVQA -> data/MGM-Finetune/dvqa
AI2D -> data/MGM-Finetune/ai2d

For model evaluation, please follow this link for preparation. We use some extra benchmarks for evaluation. please download the following the training image-based data and organize them as:

-> means put the data in the local folder.

MMMU -> data/MGM-Eval/MMMU
MMB -> data/MGM-Eval/MMB
MathVista -> data/MGM-Eval/MathVista

Please put the pretrained data, finetuned data, and eval data in MGM-Pretrain, MGM-Finetune, and MGM-Eval subset following Structure.

For meta info, please download the following files and organize them as in Structure.

Data file name	Size
mgm_pretrain.json	1.68 G
mgm_instruction.json	1.79 G
mgm_generation_pure_text.json	0.04 G

IMPORTANT: mgm_generation_pure_text.json is a generation-related subset. DO NOT merge it with mgm_instruction.json as it is already included in it. You may merge this file with your customized LLM/VLM SFT dataset to enable the reasoning generation ability.

Pretrained Weights

We recommend users to download the pretrained weights from the following link CLIP-Vit-L-336, OpenCLIP-ConvNeXt-L, Gemma-2b-it, Vicuna-7b-v1.5, Vicuna-13b-v1.5, Mixtral-8x7B-Instruct-v0.1, and Nous-Hermes-2-Yi-34B , and put them in model_zoo following Structure.

Structure

The folder structure should be organized as follows before training.

MGM
├── mgm
├── scripts
├── work_dirs
│   ├── MGM
│   │   ├── MGM-2B
│   │   ├── ...
├── model_zoo
│   ├── LLM
│   │   ├── gemma
│   │   │   ├── gemma-2b-it
│   │   ├── vicuna
│   │   │   ├── 7B-V1.5
│   │   │   ├── 13B-V1.5
│   │   ├── llama-3
│   │   │   ├── Meta-Llama-3-8B-Instruct
│   │   │   ├── Meta-Llama-3-70B-Instruct
│   │   ├── mixtral
│   │   │   ├── Mixtral-8x7B-Instruct-v0.1
│   │   ├── Nous-Hermes-2-Yi-34B
│   ├── OpenAI
│   │   ├── clip-vit-large-patch14-336
│   │   ├── openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup
├── data
│   ├── MGM-Pretrain
│   │   ├── mgm_pretrain.json
│   │   ├── images
│   │   ├── ALLaVA-4V
│   ├── MGM-Finetune
│   │   ├── mgm_instruction.json
│   │   ├── llava
│   │   ├── coco
│   │   ├── gqa
│   │   ├── ocr_vqa
│   │   ├── textvqa
│   │   ├── vg
│   │   ├── gpt4v-dataset
│   │   ├── sam
│   │   ├── share_textvqa
│   │   ├── wikiart
│   │   ├── web-celebrity
│   │   ├── web-landmark
│   │   ├── ALLaVA-4V
│   │   ├── docvqa
│   │   ├── chartqa
│   │   ├── dvqa
│   │   ├── ai2d
│   ├── MGM-Eval
│   │   ├── MMMU
│   │   ├── MMB
│   │   ├── MathVista
│   │   ├── ...

Train

The training process consists of two stages: (1) feature alignment stage: bridge the vision and language tokens; (2) instruction tuning stage: teach the model to follow multimodal instructions.

Our models are trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training.

NOTE: Please set hostfile for 2 machine training and hostfile_4 for 4 machine training.

If you want to train and finetune the framework, please run the following command for MGM-7B with image size 336:

bash scripts/llama/train/stage_1_2_full_v7b_336_hr_768.sh

or for MGM-13B with image size 336:

bash scripts/llama/train/stage_1_2_full_v13b_336_hr_768.sh

Because we reuse the pre-trained projecter weights from the MGM-7B, you can directly use the MGM-7B-HD with image size 672 for stage-2 instruction tuning:

bash scripts/llama/train/stage_2_full_v7b_672_hr_1536.sh

Please find more training scripts of gemma, llama, mixtral, and yi in scripts/.

Evaluation

We perform evaluation on several image-based benchmarks. Please download the evaluation data following Preparation and organize them as in Structure.

Model	LLM	Res.	Link	TextVQA	MMB	MME	MM-Vet	MMMU_val	MMMU_test	MathVista
MGM-2B	Gemma-2B	336	ckpt	56.2	59.8	1341/312	31.1	31.7	29.1	29.4
MGM-7B	Vicuna-7B-v1.5	336	ckpt	65.2	69.3	1523/316	40.8	36.1	32.8	31.4
MGM-13B	Vicuna-13B-v1.5	336	ckpt	65.9	68.5	1565/322	46.0	38.1	33.5	37.0
MGM-8B	LLaMA-3-8B-Instruct	336	ckpt	67.6	72.7	1606/341	47.3	38.2	36.3	--
MGM-8x7B	Mixtral-8x7B-Instruct-v0.1	336	ckpt	69.2	75.6	1639/379	45.8	41.8	37.1	41.8
MGM-34B	Nous-Hermes-2-Yi-34B	336	ckpt	70.1	79.6	1666/439	53.0	48.7	43.6	38.9
MGM-7B-HD	Vicuna-7B-v1.5	672	ckpt	68.4	65.8	1546/319	41.3	36.8	32.9	32.2
MGM-13B-HD	Vicuna-13B-v1.5	672	ckpt	70.2	68.6	1597/320	50.5	37.3	35.1	37.0
MGM-8B-HD	LLaMA-3-8B-Instruct	672	ckpt	71.6	--	1532/357	--	37.0	--	--
MGM-8x7B-HD	Mixtral-8x7B-Instruct-v0.1	672	ckpt	71.9	74.7	1633/356	53.5	40.0	37.0	43.1
MGM-34B-HD	Nous-Hermes-2-Yi-34B	672	ckpt	74.1	80.6	1659/482	59.3	48.0	44.9	43.3

If you want to evaluate the model on image-based benchmarks, please use the scripts in scripts/MODEL_PATH/eval. For example, run the following command for TextVQA evaluation with MGM-7B-HD:

bash scripts/llama/eval/textvqa.sh

Please find more evaluation scripts in scripts/MODEL_PATH.

CLI Inference

Chat with images without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization. Please make sure you have installed diffusers and PaddleOCR (only for better experience with OCR), and try this for image and generation inference:

python -m mgm.serve.cli \
    --model-path work_dirs/MGM/MGM-13B-HD \
    --image-file <path to your image>

or try this better experience with OCR (make sure you have installed PaddleOCR):

python -m mgm.serve.cli \
    --model-path work_dirs/MGM/MGM-13B-HD \
    --image-file <path to your image> \
    --ocr

or try this for inference with generation (make sure you have installed diffusers):

python -m mgm.serve.cli \
    --model-path work_dirs/MGM/MGM-13B-HD \
    --image-file <path to your image> \
    --gen

You can also try 8bit or even 4bit for efficient inference

python -m mgm.serve.cli \
    --model-path work_dirs/MGM/MGM-13B-HD \
    --image-file <path to your image> \
    --gen
    --load-8bit

Gradio Web UI

Here, we adopt the Gradio UI similar to that in LLaVA to provide a user-friendly interface for our models. To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server ONCE.

Launch a controller

python -m mgm.serve.controller --host 0.0.0.0 --port 10000

Launch a gradio web server.

python -m mgm.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload

You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.

Launch a model worker

This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path.

python -m mgm.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path work_dirs/MGM/MGM-13B-HD

Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.

You can launch as many workers as you want, and compare between different models in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

python -m mgm.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port <different from 40000, say 40001> --worker http://localhost:<change accordingly, i.e. 40001> --model-path work_dirs/MGM/MGM-34B-HD

If you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the --device flag: --device mps.

Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)

If the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can specify which GPUs to use with CUDA_VISIBLE_DEVICES. Below is an example of running with the first two GPUs.

CUDA_VISIBLE_DEVICES=0,1 python -m mgm.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path work_dirs/MGM/MGM-13B-HD

Launch a model worker (4-bit, 8-bit inference, quantized)

You can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append --load-4bit or --load-8bit to the model worker command that you are executing. Below is an example of running with 4-bit quantization.

python -m mgm.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path work_dirs/MGM/MGM-13B-HD --load-4bit

Examples

We provide some examples in this section. More examples can be found in our project page.

Hi-Resolution Understanding

Generation with Reasoning

Citation

If you find this repo useful for your research, please consider citing the paper

@article{li2024mgm,
  title={Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models},
  author={Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya},
  journal={arXiv:2403.18814},
  year={2023}
}

Acknowledgement

This project is not affiliated with Google LLC.

We would like to thank the following repos for their great work:

This work is built upon the LLaVA.
This work utilizes LLMs from Gemma, Vicuna, Mixtral, and Nous-Hermes.

License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

mgm's People

Contributors

Stargazers

Watchers

Forkers

wcy1122 julianjuaner shaotengliu crackercat bruinxiong evdcush otherbackup dtbinh hertera1 frostbyte012 eltociear donbr stjordanis itsharex jac0320 techthiyanes evelynmitchell kotthoff xiaolong-rrl hfengzhi ericxsun dearborn-open-ai duyuankai1992 qwerty6518 friendmine jakubik2023 josephrp chp20230826 slidersun mandarukrulkar thanhuit23 yanxg j-l-o hudawei996 bachvudinh f901107 vinodtahelyani sereact jwgu paperwave lathis4 xiechengmude bo-work nemonameless alphaqi apokar liunix61 haikuoxin nhsjgczryf vtea long630904 fangwudi flclain jiangzongkang dogpandacat alexlly xianzhuoliu ainisa20 bzwqq onenotell stephan-who kknet lovecove hezuogongying 360mini chaojigang001 slimsymphony yjfkpyu jqk6 lingmao01 linguo123 albertbj tonghengcheng jiaqianjing chrysfay neomx7 aixia121 cuiyuheng dthcle captain29999 hubin858130 maxmax2016 jimmyleesnow githubmalajava qiangqiang199 thanhpham1987 daishoui weisili2016 zcfrank1st shaun95 qinwentu z1qsx leoncuhk hunaid2000 hesam7711 ablaze-projects ruiwenhong xuexiaofei comphy zhangliaoxiaoqin

mgm's Issues

Not find vision tower: model_zoo/OpenAI/clip-vit-large-patch14-336

Hi, Nice work! but when i run the example code , i have encounter some problems,when i run python -m minigemini.serve.cli --model-path YanweiLi/Mini-Gemini-13B-HD --image-file Woman.jpg ,i got the error ValueError: Not find vision tower: model_zoo/OpenAI/clip-vit-large-patch14-336 ,how can i solve this problem thank u!

Possible positional embedding in Patch Info Mining process

Thanks for sharing your great work. I have a small question about the attention process in the patch info mining module. I found that there is no positional embedding for the high-resolution tokens to indicate their position in the corresponding patch. Do you think adding positional embedding here would be helpful and have you tried this?

怎么从你们的检查点启动呢

我发现教程只有从开源大模型的权重微调的，如果要在你们的检查点启动使用什么方法呢

关于token数量的疑问

针对llama2-7b模型，max token length为2048，按照stage2 的训练参数，一旦设置IMAGE_GRID=2和IMAGE_GLOBAL为True，
image_features = torch.cat([image_feat_global, image_features], dim=1)
image_aux_features = torch.cat([image_aux_feat_global, image_aux_features], dim=1)
两行最终得到的图像特征token数就变成了2880，这里token数不是超了吗？如果理解有误敬请指正。

About vl model path

model_zoo/OpenAI/clip-vit-large-patch14-336 and model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup from https://huggingface.co/YanweiLi/Mini-Gemini-7B/blob/main/config.json#L31

What is the exact model path of the huggingface?

Can provide laion-gpt4v dataset images zip?

Hi, after downloaded the laion-gpt4v images, I got only: 11686 images, am using json index order as image name, to avoid the bias between dataset, and possibly wrong annotation to image, just to be sure, does the last image index is:

Why does any stage didn't optimize vision encoder?

Nice work, from the scripts provided, seems all optimize_vision_tower either aux were not used.

Then I have some puzzels:

Is that optimize get worse results?
If not finetune the vision tower, how to make it learn new visual tokens?

关于代码实现的疑问

作者好，非常感谢你的开源。我有两个疑问想咨询下

1. convnext 的 drop path 问题

convnext 的 drop path 是 0.1，虽然你在训练时候设置的是不训练，但是这个路径还是会执行的。原则上在训练时候，convnext 要设置为 eval 模式吧？我没有找到相关代码，有点奇怪，想了解下

2. 代码鲁棒性问题

如果是 clip+convnext 组合方式训练没有任何问题。但是我想尝试 siglip，发现在仅仅将 clip 换成 siglip 后(均值和方差也是两套)，模型在 iter=2 时候就会出现 nan。注释如下

        # token attention
        embed_query = self.vlm_uni_query_projector(images)
        embed_aux = self.vlm_uni_aux_projector(images_aux)
        embed_value = self.vlm_uni_val_projector(images_aux)
        # TODO siglip+convnext 在第一次 forward 后正常，但是 embed_att 会出现 nan
        # TODO 导致第二次迭代时候 embed_value 会出现 nan，无法训练
        # TODO 怀疑是特征不匹配，即使全部转换为 fp32 也会出现 nan, 需要进一步排查
        embed_att = embed_query[:, :, None] @ (embed_aux.transpose(-1, -2) / (embed_aux.shape[-1] ** 0.5))
        # print('=xxxx=', torch.any(torch.isnan(embed_query)).item(),
        #       torch.any(torch.isnan(embed_aux)).item(),
        #       torch.any(torch.isnan(embed_value)).item(),
        #       torch.any(torch.isnan(embed_att)).item())
        embed_att = embed_att.nan_to_num()
        embed_feat = (embed_att.softmax(-1) @ embed_value).mean(2)
        # print('=xxcccxx=', torch.any(torch.isnan(embed_feat)).item())
        image_features = images + embed_feat
        return image_features

不知道作者对于这个问题是咋看的。有一个细节不一样：因为 siglip 输入是 384x384 的，输出尺度是 27x27，所以我必须要将 convnext 输入分辨率设置为 864, 这样才可以保证两者空间完全对齐。

期待您的回复。

minigemini_instruction.json包含了预训练的LLaVA Images，但是在readme中没有写到微调用到了这部分数据

minigemini_instruction.json中有条目的image为llava/LLaVA-Pretrain/images/00013/000133305.jpg，有很多llava的图片，但是readme中没有提到要用llava的图片，是你们疏忽了吗

images_aux encode error

When using clip.py for inference, I encountered the following error. How should I solve it?

File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/mini_gemini_arch.py", line 255, in encode_images
if images_aux is not None:
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 58, in forward
image_features = self.image_forward(images)
File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 50, in image_forward
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 917, in forward
return self.vision_model(
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 841, in forward
hidden_states = self.embeddings(pixel_values)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 187, in forward
embeddings = embeddings + self.position_embedding(self.position_ids)
RuntimeError: The size of tensor a (2305) must match the size of tensor b (50) at non-singleton dimension 1

batch inference

Hi could you tell me how I can do batch inferencing?
I have multiple images and a different prompt for eacg image, so is there a way i can get the output in one go ?

Questions about how to enlarge the base vision tower input resolution

Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.

I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).

Could that be possible? Can u give me some advisor how to adopt it?

Could you please share the pretrain weights after stage 1?

AutoTokenizer resolve error

Simply run the demo code as following fails:

python -m minigemini.serve.cli \
    --model-path YanweiLi/Mini-Gemini-2B \
    --image-file ./images/demo_gen.png

error message:

  File "XXXXX/envs/minigemini/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 784, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class GemmaTokenizer does not exist or is not currently imported.

Update to newest transformers fix this problem

RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 1, 3, 336, 336]

When I generate it using the example image, it tells me that the dimensions are not correct

How to train 34B model?

batching giving weird outputs

Hi I noticed that when doing batch inference with a static prompt like ‘Describe the image’ the model gives wrong output like ‘in detail’, like it is just doing sentence completion. Whereas if I try a more descriptive prompt, where I tell minigemini that it is a ‘prompt generator’ then it goes into sentence completion mode and gives me an okayish response.

However, I have the original image description also, so I tried adding those in the prompt and then asking the model to describe the image, given the information about the image. This works perfectly fine when I just use 1 image at a time.

But when doing batch processing I get completely garbage outputs. To do batch processing, I use padding to get the prompts to the same shape. But this gives me completely garbage output.
I do this by changing line 44 in / MiniGemini/minigemini/mm_utils.py
to
tokenizer(chunk, padding='max_length', max_length=max_len).input_ids for chunk in prompt.split('')

Could you give me any advice on how to do this effectively?

performing finetune on top of mini-gemini-8x7b-HD

when performing finetune on top of the minigemini-8x7b-HD model using the following config:

#!/bin/bash
PRETRAIN_NAME=Mini-Gemini-8x7B-Pretrain
FINETUNE_NAME=Mini-Gemini-8x7B-HD
AUX_SIZE=1536
IMAGE_GRID=2
IMAGE_GLOBAL=True
LR_MULTI="model.mm_projector:2,model.vlm_uni:2"

# delete --hostfile hostfile_4 and change --per_device_train_batch_size if trained on single machine

deepspeed minigemini/train/train_mem.py \
    --deepspeed ./scripts/zero3_offload.json \
    --model_name_or_path Mini-Gemini-8x7B-HD \
    --version mistral_instruct \
    --data_path data/minigemini_instruction.json \
    --image_folder data/figures \
    --vision_tower model_zoo/OpenAI/clip-vit-large-patch14-336 \
    --vision_tower_aux model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup \
    --image_grid $IMAGE_GRID \
    --image_global $IMAGE_GLOBAL \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --image_size_aux $AUX_SIZE \
    --bf16 True \
    --output_dir ./work_dirs/experiment-2/$FINETUNE_NAME \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --save_strategy "steps" \
    --save_steps 2 \
    --save_total_limit 1 \
    --learning_rate 8e-4 \
    --lr_multi $LR_MULTI \
    --weight_decay 0. \
    --warmup_ratio 0.05 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 8192 \
    --gradient_checkpointing True \
    --dataloader_num_workers 64 \
    --lazy_preprocess True \
    --report_to wandb

The final model only has 4 safetensors files instead of 20, why is that ? thanks.

evaluate MMMU_val of 2B error

Hi, I try evaluating MMMU_val of 2B by running this command

 bash /home/user/minigemini/scripts/gemma/eval/mmmu.sh

but get error log as follows while evaluation of MMMU_val on 7B, 13B and 34B models could run successfully.

Traceback (most recent call last):
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/run_llava.py", line 207, in <module>
    main()                               
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/run_llava.py", line 158, in main
    response = call_model_engine(args, sample, model, tokenizer, processor)
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/utils/model_utils_ind.py", line 53, in call_llava_engine_df
    output_ids = model.generate(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/MiniGemini/minigemini/model/language_model/mini_gemini_gemma.py", line 144, in generate
    return super().generate(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/generation/utils.py", line 1648, in generate
    result = self._beam_sample(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/generation/utils.py", line 3402, in _beam_sample
    outputs = self(                      
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/MiniGemini/minigemini/model/language_model/mini_gemini_gemma.py", line 97, in forward
    return super().forward(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 1105, in forward
    outputs = self.model(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 891, in forward
    causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 984, in _update_causal_mask
    causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
RuntimeError: The size of tensor a (701) must match the size of tensor b (0) at non-singleton dimension 0
scripts/gemma/eval/mmmu.sh: line 31: MMMU/answers/Mini-Gemini-2B/merge.jsonl: No such file or directory
scripts/gemma/eval/mmmu.sh: line 35: MMMU/answers/Mini-Gemini-2B/merge.jsonl: No such file or directory
Traceback (most recent call last):
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/eval.py", line 31, in <module>
    main()                               
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/eval.py", line 17, in main
    out_samples = [json.loads(line) for line in open(args.result_file)]
FileNotFoundError: [Errno 2] No such file or directory: 'MMMU/answers/Mini-Gemini-2B/merge.jsonl'

Demo page is not working

Demo page is not working also domain is an ip adress.

Begin training loss a little high

Hi, the begin loss is a little high, does it normal?

{'loss': 6.7177, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.7738, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.801, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                              
{'loss': 6.5226, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.813, 'learning_rate': 7.633587786259541e-06, 'epoch': 0.0}                                                                                                                                                            
{'loss': 6.6355, 'learning_rate': 1.5267175572519083e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 6.3212, 'learning_rate': 2.2900763358778628e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 6.1449, 'learning_rate': 3.0534351145038166e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.8368, 'learning_rate': 3.816793893129771e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.9083, 'learning_rate': 4.5801526717557256e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.7246, 'learning_rate': 5.3435114503816794e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.6311, 'learning_rate': 6.106870229007633e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.7906, 'learning_rate': 6.870229007633588e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.6311, 'learning_rate': 7.633587786259542e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.1357, 'learning_rate': 8.396946564885496e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.9726, 'learning_rate': 9.160305343511451e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.7747, 'learning_rate': 9.923664122137405e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.8886, 'learning_rate': 0.00010687022900763359, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.7224, 'learning_rate': 0.00011450381679389313, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.4193, 'learning_rate': 0.00012213740458015266, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.5041, 'learning_rate': 0.00012977099236641222, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.3678, 'learning_rate': 0.00013740458015267177, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.3701, 'learning_rate': 0.0001450381679389313, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.505, 'learning_rate': 0.00015267175572519084, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.5022, 'learning_rate': 0.00016030534351145037, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.2254, 'learning_rate': 0.00016793893129770992, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.339, 'learning_rate': 0.00017557251908396944, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.5537, 'learning_rate': 0.00018320610687022902, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.2307, 'learning_rate': 0.00019083969465648857, 'epoch': 0.01}

How's the loss curve on your side? (am using a 4B llm model

Minor issue of the HR feature map size

In section 3.1 of the paper, Should "N`=H/4*W/4" be adjusted to "N`*N`=H/4*W/4"?

size mismatch about convnext_large_d_320

Hello, I downloaded the pre-trained model based on the readme and placed it in the corresponding location. Which step did I have a problem with? Please help.

Where is the data of Generation-related Instructions in Section 3.3?

Thank you for your great work!
Could you clarify where the data about Generation-related Instructions in Section 3.3 is located?

Pretrain data not found in AllaVA

Hi, the pretrained data used allava images both from laion nad vfan.

But the laion part image names are totally different from ALLava's images format.

I tried to found:

465440.jpeg
320609

they all used in minigemini_pretrain.json but can not be found in ALLava images folder.

ls -f images | grep 465440
46544031.jpeg
(base) ➜  allava_laion git:(main) ✗ ls -f images | grep 320609     
132060956.jpeg
43206091.jpeg

why is that？

Fixing the pretrain and sft stage ALLaVa images issue.

Hi, as previous issue raised, the data all wrong for any data from ALLaVA, they have changed the image names.

What even worse, even tough using url mapping the image names with latest ALLaVa's caption json, still some images were totally can not be found, in latest ALLaVa.

for example:

cat ALLaVA-Caption-LAION-4V.json | grep 'https://slideplayer.it/slide/553401/1/images/40/Delayed+relaxation+filling+pattern.jpg' -C 9

this url existed in minigemini but gone at latest ALLava images

Therefore, please help fix the data first, it emergency. Stops anyone want to reveal the minigemini training and scores.

How many SAM images were used from ShareGPT4v?

I downloaded sharegpt4v used fientuneing part data, but always got image not found, am using finetune stage.

Does finetune used sharegpt4v pretrain data?

sharegpt4v finetune just used very little data from SAM.

Shall we download the. sam_000000 - 0000050 whole 500GB images for it?

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory model_zoo/OpenAI/clip-vit-large-patch14-336.

Hello, everyone, I am using MiniGemini evaluation on an image by typing command:

python -m minigemini.serve.cli  --model-path ./Mini-Gemini-2B/     --image-file replaced_with_path_to_image

then following OSError emerged:

Traceback (most recent call last):
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/serve/cli.py", line 237, in <module>
    main(args)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/serve/cli.py", line 56, in main
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/model/builder.py", line 112, in load_pretrained_model
    vision_tower.load_model()
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 33, in load_model
    self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name)
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3144, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory model_zoo/OpenAI/clip-vit-large-patch14-336.

Some internet friends share their solutions: Downloading the corresponding "xxx.index.json" file, but I can't find any "xxx.index.json" file of "clip-vit-large-patch14-336" on the huggingface website.

I think maybe the relative path leads to the probelm, so I replace it with absolute path, but the same problem is shot.

My environment:
OS: ubuntu 22.04 64 bit
python: 3.10.14
others: other libraries are installed according to MiniGemini's official installation guide.

Does anybody have a solution for that, I will be very grateful for you, thank u.

RuntimeError: shape '[3, 2, 336, 2, 336]' is invalid for input of size 2709504

command used:
python -m minigemini.serve.cli
--model-path work_dirs/Mini-Gemini/Mini-Gemini-13B-HD
--image-file image_1.png,image_2.png

AttributeError: 'MiniGeminiLlamaModel' object has no attribute 'vlm_uni_query_projector'

when i was running minigemini.serve.cli to inference an image（minigemini_34b_hd），i got this error：AttributeError: 'MiniGeminiLlamaModel' object has no attribute 'vlm_uni_query_projector'.
Should i put all the following pretrainded models to the specific directory？or just need some of them.

Looking forward to your reply
Thanks

Could you offer LAION GPT4V images

follew https://huggingface.co/datasets/laion/gpt4v-dataset. There are about 400-500 pictures that cannot be downloaded. Could you provide the complete version?

How to get 13K generation-related instructions dataset?

Dear author:

Thanks for your interesting work.

But I am still confused about following questions:

How to get 13K generation-related instructions dataset?
How to only tuning LLM for generation (or tuning LLM for understanding and generation separately)

Looking forward to your reply~
Thanks!!

issues running cli inference example

command: python -m minigemini.serve.cli --model-path Mini-Gemini-34B-HD --image-file test_image.png

  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/paperspace/MiniGemini/minigemini/serve/cli.py", line 241, in <module>
    main(args)
  File "/home/paperspace/MiniGemini/minigemini/serve/cli.py", line 200, in main
    output_ids = model.generate(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/MiniGemini/minigemini/model/language_model/mini_gemini_llama.py", line 183, in generate
    return super().generate(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1575, in generate
    result = self._sample(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2697, in _sample
    outputs = self(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
TypeError: MiniGeminiLlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'```

Continue FT from stage 2 with custom data

Hi, was wondering whether the stage2 script would be applicable for further FT from stage 2 with a small custom dataset for domain transfer? Or do we have to rewrite a separate script to do this?

Thanks and appreciate any help given!

Regards,

Adriel

Training model got some error

TypeError: pad_sequence(): argument 'padding_value' (position 3) must be float, not NoneType

Am using Qwen as LLM, got above error, what could be the reason? I have checked:

if tokenizer.pad_token_id == None:
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.unk_token
        tokenizer.pad_token_id = tokenizer.encode(tokenizer.pad_token)

didnt make it work, any help could be possible?

Training on customized LLM loss got sort of high

Hi, am using Qwen2 4b as LLM training the model, the template solving way I previous tried on LLaVa works OK, and no warnings appeared.

But when pretrain on minigemini, got unexpected loss result:

{'loss': 1.9597, 'learning_rate': 0.0007436649460805199, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8565, 'learning_rate': 0.0007435934693368483, 'epoch': 0.36}                                                                                                                                                            
{'loss': 2.0595, 'learning_rate': 0.0007435219860653267, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9304, 'learning_rate': 0.0007434504962678705, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9705, 'learning_rate': 0.0007433789999463957, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9536, 'learning_rate': 0.000743307497102818, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.9405, 'learning_rate': 0.0007432359877390538, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8501, 'learning_rate': 0.0007431644718570192, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8722, 'learning_rate': 0.000743092949458631, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.6984, 'learning_rate': 0.0007430214205458056, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8309, 'learning_rate': 0.0007429498851204598, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8681, 'learning_rate': 0.0007428783431845109, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9383, 'learning_rate': 0.0007428067947398757, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8059, 'learning_rate': 0.000742735239788472, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.8016, 'learning_rate': 0.0007426636783322172, 'epoch': 0.36}                                                                                                                                                            
{'loss': 2.0427, 'learning_rate': 0.0007425921103730288, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.899, 'learning_rate': 0.0007425205359128248, 'epoch': 0.36}

Looks like stucked at ~1.8, no longer decrease, what could be the reason for this? (the data all same)

(I tried using both chat template in pretrain and finetune, same as mini does)

TypeError: MiniGeminiMixtralForCausalLM.forward() got an unexpected keyword argument 'output_router_logits'

Hi, I'm trying to use MiniGemini outside of the demo environment, but am running into the following error when calling model.generate():

  File "/app/backend/minigemini.py", line 104, in chat_with_images
    output_ids = self.model.generate(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/MiniGemini/minigemini/model/language_model/mini_gemini_mixtral.py", line 142, in generate
    return super().generate(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
              ^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: MiniGeminiMixtralForCausalLM.forward() got an unexpected keyword argument 'output_router_logits'

I have MiniGemini-2B working, but Mixtral is still giving me some trouble. The only relevant reference I can find is: https://www.opensourceagenda.com/projects/transformers/versions, which mentions output_router_logits was removed in transformers 4.39.0.

I see a similar error with Mini-Gemini-7B:

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: MiniGeminiLlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'

I'm using:
transformers 4.39.3
accelerate-0.29.1
torch 2.2.2
torchvision 0.17.2

I confess, the dependencies for MiniGemini are a real challenge to integrate, but I hope it can get resolved.

Similarities to LLaVA-HR

Congratulations on your great work and solid performance! However, we notice that your core idea and design are highly similar to our previsiou work LLaVA-HR, especially the dual visual pathways for MLLMs. We would like to see simple clarifications and discussions with our work in your paper. Thank you!

运行代码报错AttributeError: 'list' object has no attribute 'to'， image_aux_features_raw = self.get_model().get_vision_tower_aux()(images_aux).to(dtype=image_features.dtype, device=image_features.device)

Traceback (most recent call last):
File "/checkpoint/binary/train_package/minigemini/train/train_mem.py", line 14, in
train(attn_implementation="flash_attention_2")
File "/checkpoint/binary/train_package/minigemini/train/train.py", line 1262, in train
trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2902, in training_step
loss = self.compute_loss(model, inputs)
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2925, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1833, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/checkpoint/binary/train_package/minigemini/model/language_model/mini_gemini_gemma.py", line 87, in forward
) = self.prepare_inputs_labels_for_multimodal(
File "/checkpoint/binary/train_package/minigemini/model/mini_gemini_arch.py", line 328, in prepare_inputs_labels_for_multimodal
image_features = self.encode_images(images, images_aux)
File "/checkpoint/binary/train_package/minigemini/model/mini_gemini_arch.py", line 255, in encode_images
image_aux_features_raw = self.get_model().get_vision_tower_aux()(images_aux).to(
AttributeError: 'list' object has no attribute 'to'

4 bit loading fails

I have tried both the model worker and the cli, and both when passed 4bit loading just fail with the error message:

Loading pretrained weights (convnext_large_d_320).
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx/minigemini/serve/cli.py", line 234, in <module>
    main(args)
  File "/xxx/minigemini/serve/cli.py", line 56, in main
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)
  File "/xxx/minigemini/model/builder.py", line 124, in load_pretrained_model
    model.get_model().initialize_uni_modules(model.config, for_eval=True)
  File "/xxx/minigemini/model/mini_gemini_arch.py", line 213, in initialize_uni_modules
    get_w(projector_weights, 'vision_tower.vision_tower', self.vision_tower, 'vision_tower')
  File "/xxx/minigemini/model/mini_gemini_arch.py", line 209, in get_w
    getattr(main_module, sub_module).to(device=device_type, dtype=weight_type)
  File "/xxx/venv/lib64/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
    return super().to(*args, **kwargs)
  File "/xxx/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in to
    raise TypeError('nn.Module.to only accepts floating point or complex '
TypeError: nn.Module.to only accepts floating point or complex dtypes, but got desired dtype=torch.uint8

loading with 8bit works, but OOMs on my hardware. (24+24 vram)

Demo inference wrong?

Don't see any positive information on a simple task?

Limitations on current image generating method

Hi, I found currently the image generating way hard to make input image as reference:

any thought about it?

BTW, I found the pretrain stage of this loss is very high:

{'loss': 2.573, 'learning_rate': 0.0007701925673852566, 'epoch': 0.34}                                                                                                                                                           
{'loss': 2.8217, 'learning_rate': 0.0007698799612970509, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.5646, 'learning_rate': 0.0007695672062744539, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6857, 'learning_rate': 0.0007692543024900611, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6944, 'learning_rate': 0.0007689412501165496, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6418, 'learning_rate': 0.0007686280493266786, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6801, 'learning_rate': 0.0007683147002932893, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.7245, 'learning_rate': 0.0007680012031893049, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6275, 'learning_rate': 0.0007676875581877296, 'epoch': 0.34}

3 gradio issues

There's quite a few gradio issues:

function_markdown is undefined
unexpected keyword concurrency_limit
recursive json encoder

First two were "fixed" by just commenting them out, however the third issue prevent gradio working at all as it immediately crashes the instance with that error.

/xxx//minigemini/serve/gradio_web_server.py:351: UserWarning: `layout` parameter is deprecated, and it has no effect
  chatbot = gr.Chatbot(
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx/minigemini/serve/gradio_web_server.py", line 472, in <module>
    demo = build_demo(args.embed, concurrency_count=args.concurrency_count)
  File "/xxx/minigemini/serve/gradio_web_server.py", line 371, in build_demo
    gr.Markdown(function_markdown)

/xxx//minigemini/serve/gradio_web_server.py:351: UserWarning: `layout` parameter is deprecated, and it has no effect
  chatbot = gr.Chatbot(
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx//minigemini/serve/gradio_web_server.py", line 472, in <module>
    demo = build_demo(args.embed, concurrency_count=args.concurrency_count)
  File "/xxx//minigemini/serve/gradio_web_server.py", line 394, in build_demo
    regenerate_btn.click(
TypeError: EventListenerMethod.__call__() got an unexpected keyword argument 'concurrency_limit'

File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 331, in jsonable_encoder
  return jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 331, in jsonable_encoder
  return jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 318, in jsonable_encoder
  if isinstance(obj, classes_tuple):
File "/usr/lib64/python3.10/abc.py", line 119, in __instancecheck__
  return _abc_instancecheck(cls, instance)
RecursionError: maximum recursion depth exceeded in comparison

any idea coping with long video input?

Dear author:
Thanks for publishing the mini gemini paper. As Gemini 1.5 support up to one hour long video input for 1fps sampling, I wonder how to adopt your framework to support long video training and inference? thank you.

No image generated while run python -m minigemini.serve.cli --gen

Dear author:

Thanks for your interesting work.

When I run the following command with the input What is unusual about this image?, no image generated in the output:

python -m minigemini.serve.cli \
    --model-path work_dirs/Mini-Gemini/Mini-Gemini-2B \
    --image-file examples/extreme_ironing.jpg \
    --gen

I wonder if the Mini-Gemini-2B model does not have the ability to generate images?

And if fine-tuning is needed, which datasets should be used to make the model output <h> ... </h>?

Thanks!!

CLI of 2B does not work

How to reproduce:

Install this repo as README.md
run the following commands

export HF_ENDPOINT=https://hf-mirror.com

python -m minigemini.serve.cli \
    --model-path ./Mini-Gemini/Mini-Gemini-2B \
    --image-file ./images/demo_gen.png  \
    --debug \

Results:

The model does not generate anything since the stop_str is set to be EMPTY STRING ''. After I fixed this, I also found the chat history is not preserved in the prompt which results the multi-turn conversation results to be unexpected.

To precisely:

the loss are lower, normally I got 0.9-1.0 , but using CLIP with input size 378, the loss can to 0.7-0.8, but the inference result are very poor;
The CLIP model I used was Apple's DNFS_vit_G_378 model.
I have changed the convnext input resuoltion accordingly.

Any reason for this? This is really weired, better and larger ViT got bad results.

dvlab-research / mgm Goto Github PK

mgm's Introduction

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Release

Contents

Demo

Install

Model

Preparation

Dataset

Pretrained Weights

Structure

Train

Evaluation

CLI Inference

Gradio Web UI

Launch a controller

Launch a gradio web server.

Launch a model worker

Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)

Launch a model worker (4-bit, 8-bit inference, quantized)

Examples

Hi-Resolution Understanding

Generation with Reasoning

Citation

Acknowledgement

License

mgm's People

Contributors

Stargazers

Watchers

Forkers

mgm's Issues

Recommend Projects

Recommend Topics

Recommend Org