Code Monkey home page Code Monkey logo

visual-chinese-llama-alpaca's Introduction

🇨🇳中文 | 🌐English



GitHub GitHub top language GitHub last commit

Visual-Chinese-LLaMA-AlpacaVisualCLA)是基于中文LLaMA&Alpaca大模型项目开发的多模态中文大模型。VisualCLA在中文LLaMA/Alpaca模型上增加了图像编码等模块,使LLaMA模型可以接收视觉信息。在此基础上,使用了中文图文对数据进行了多模态预训练,对齐图像与文本表示,赋予其基本的多模态理解能力;并使用多模态指令数据集精调,增强其对多模态指令的理解、执行和对话能力。

本项目仍处于开发阶段,目前发布的是供预览的测试版本,模型效果还在优化中。

本项目主要内容:

  • 🚀 基于Chinese-LLaMA-Alpaca的多模态模型VisualCLA,具备多模态指理解和对话能力
  • 🚀 提供了推理代码和基于Gradio/Text-Generation-WebUI的部署脚本
  • 🚀 展示了模型在多模态指令理解任务上的效果,并开放了翻译的测试集
  • 🚀 目前开源版本: VisualCLA-7B-v0.1(测试版)

演示示例




中文LLaMA-2&Alpaca-2大模型 | 中文LLaMA&Alpaca大模型 | 多模态VLE | 中文MiniRBT | 中文LERT | 中英文PERT | 中文MacBERT | 中文ELECTRA | 中文XLNet | 中文BERT | 知识蒸馏工具TextBrewer | 模型裁剪工具TextPruner

新闻

[2023/07/18] Demo添加了Webcam支持,可以从直接摄像头拍摄照片

内容导引

模型介绍

Visual-Chinese-LLaMA-Alpaca(VisualCLA)是一个支持图像和文本输入的中文多模态模型。VisualCLA在中文Alpaca模型的基础上,添加了图像编码模块,使中文Alpaca模型能理解视觉信息。



VisualCLA由Vision Encoder、Resampler和LLM三部分组成:

  • Vision Encoder:采用ViT结构,对输入图像编码,得到图像的序列表示。发布的VisualCLA模型采用了CLIP-ViT-L/14作为图像编码器的结构和初始化权重。
  • Resampler:采用6层的类BERT结构,其结构与功能类似于Flamingo中的Perceiver Resampler或BLIP-2中的Q-Former,通过可训练的query向量对图像表示进行重采样,减小图像表示的长度。然后,通过线性层将图形表示对齐到LLM的维度。该部分的参数从头开始训练。
  • LLM:采用LLaMA模型,并使用Chinese-Alpaca-Plus 7B初始化。

图像经过Vision Encoder编码,通过Resampler映射为固定长度的表示。随后,将图像和文本表示拼接后送入LLM。LLM根据图像和文本指令生成结果。

训练策略

与Chinese-LLaMA-Alpaca类似,VisualCLA采用LoRA对模型进行高效精调。可训练参数包括图像编码器的LoRA参数,LLM的LoRA参数以及Resampler的全部参数。可参考模型结构图中的说明。训练过程分为两个阶段:

  • 多模态预训练:采用中文图文对数据训练,模型根据图像生成对应的文本描述(caption)。
  • 多模态指令精调:基于上一步得到的模型,在由多种有监督任务数据构建的多模态指令数据集上精调。数据集中包括视觉问答、视觉推理、开放域问答、OCR等任务类型。同时也混入了一部分纯文本指令数据,弥补多模态数据的不足以及缓解遗忘指令跟随能力。该阶段使用了与Chinese-Alpaca模型相同的指令模版。

VisualCLA-7B-v0.1的训练相关信息总结于下表:

训练阶段 多模态预训练 多模态指令精调
初始化 Chinese-Alpaca-Plus 7B 多模态预训练模型
训练任务 多模态预训练 多模态指令精调
任务类型 图像描述(Captioning) 视觉问答、视觉推理、开放域问答、OCR等
Prompt模版 Alpaca prompt模版
训练集大小(样本数量) 23M 350K(多模态指令) + 1.3M(纯文本指令)

模型下载

LLaMA模型禁止商用,为了遵循相应的许可,本项目发布增量权重,包括:

  • LLaMA的LoRA、embedding和LM head权重
  • CLIP-ViT的LoRA权重
  • Resampler的全部权重

用户需要在Chinese-Alpaca-PlusCLIP-ViT的基础上加载或合并模型,以得到完整可用的VisualCLA模型。

模型名 依赖的基模型 增量权重下载
VisualCLA-7B-v0.1 Chinese-Alpaca-Plus 7B (HF格式) + CLIP-ViT-L/14 [百度网盘]
[Google Drive]

†: Chinese-Alpaca-Plus 7B模型的获取与合并方法请参考Chinese-LLaMA-Alpaca模型合并与转换

‡: CLIP-ViT-L/14模型下载链接

Model Hub

也可以在🤗Model Hub下载模型,使用transformers和PEFT调用VisualCLA。以下模型调用名称指的是使用.from_pretrained()中指定的模型名称。使用示例可参见模型使用

模型名 模型调用名称 链接
VisualCLA-7B-v0.1 ziqingyang/visualcla-7b-v0.1 Hub地址

压缩包内包含如下文件:

visualcla-7b-v0.1/
  - adapter_config.json      # LoRA配置文件
  - adapter_model.bin        # LoRA权重文件
  - config.json              # VisualCLA配置文件
  - added_tokens.json        # tokenizer配置文件
  - special_tokens_map.json  # tokenizer配置文件
  - tokenizer_config.json    # tokenizer配置文件
  - tokenizer.model          # tokenizer文件
  - preprocessor_config.json # ImageProcessor配置文件

模型使用

Colab笔记本

对于模型的安装、合并、推理和部署等流程,除了下述的步骤说明外,我们还提供了Colab笔记本,用户可方便地直接执行、体验并查看结果:

笔记本名 内容 链接 notebook文件
visualcla_inference.ipynb 模型的安装、合并、命令行推理和Gradio demo部署 Open In Colab visualcla_inference.ipynb

安装

将本项目下载至本地,安装模型代码至Python搜索路径

git clone https://github.com/airaria/Visual-Chinese-LLaMA-Alpaca
cd Visual-Chinese-LLaMA-Alpaca
pip install -e .

合并模型(可选,推荐)

用户可以选择将增量权重与基模型合并后保存,使用更方便,加载更迅速。合并后的模型大小约为14G,合并过程约需占用20G内存,请确保机器有足够的硬盘和内存空间。

执行本项目中的scripts/merge_llama_with_vcla_lora.py进行合并:

python scripts/merge_llama_with_visualcla_lora.py \
    --text_model /path/to/chinese/alpaca/plus/7b \
    --vision_model /path/to/clip/vit/14-L \
    --lora_model /path/to/visualcla/lora \
    --output_dir output_dir

参数说明:

  • --text_model:Chinese-Alpaca-Plus 7B模型所在目录
  • --vision_model:CLIP-ViT-14/L模型所在目录
  • --lora_model:VisualCLA LoRA模型所在目录
  • --output_dir:保存合并后模型的目录

传入的模型所在目录也皆可用🤗Model Hub上的模型名代替。

合并后的output_dir目录内容如下:

output_dir/
 - text_encoder/             # LLM的模型权重和配置
 - image_encoer/             # Vision Encoder的模型权重和配置
 - pytorch_model.bin         # Resampler部分的权重
 - config.json               # VisualCLA的配置文件
 - added_tokens.json         # tokenizer配置文件
 - special_token_map.json    # tokenizer配置文件
 - tokenizer_config.json     # tokenizer配置文件
 - tokenizer.model           # tokenizer文件
 - preprocessor_config.json  # ImageProcessor配置文件

可以使用visualcla.get_model_and_tokenizer_and_processor加载,详见下节。

模型加载与推理

接口调用

如果已合并模型

可以使用如下代码在Python程序中调用VisualCLA:

import torch
import visualcla
model, tokenizer, _ = visualcla.get_model_and_tokenizer_and_processor(
      visualcla_model="/path/to/the/merged/visualcla/model",
      torch_dtype=torch.float16,
      load_in_8bit=True
)
model.to(0)
history=[]
visualcla.chat(model=model, image="path/to/image/filename", text="your instruction here", history=history)

如果未合并模型

需要同时加载Chinese-Alpaca-Plus-7B,CLIP-ViT-L/14和VisualCLA LoRA:

import torch
import visualcla
from peft import PeftModel
base_model, tokenizer, _ = visualcla.get_model_and_tokenizer_and_processor(
      text_model="/path/to/chinese/alpaca/plus/7b",  # Path to the Chinese-Alpaca-Plus 7B model
      vision_model="openai/clip-vit-large-patch14",  # We can also use the Model Hub name of the model
      lora_model="/path/to/visualcla/lora",
      torch_dtype=torch.float16
)
base_model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base_model, "/path/to/visualcla/lora", torch_dtype=torch.float16)
model.to(0)
history = []
visualcla.chat(model=model, image="path/to/image/filename", text="your instruction here",history=history)

推理脚本

在本项目的scripts/inference文件夹下提供了封装更完善的Python推理脚本inference.py

如果已合并模型

python scripts/inference/inference.py \
    --visualcla_model visualcla_model \
    --image_file image_file \
    --load_in_8bit

如果未合并模型

python scripts/inference/inference.py \
    --text_model /path/to/chinese/alpaca/plus/7b \
    --vision_model /path/to/clip/vit/14-L \
    --lora_model /path/to/visualcla/lora \
    --image_file image_file
    # 未合并的模型暂不支持8bit加载

参数说明:

  • --text_model:合并后的Chinese-Alpaca-Plus 7B模型所在目录,或🤗Model Hub上的模型名
  • --vision_model:CLIP-ViT-14/L模型所在目录,或🤗Model Hub上的模型名
  • --lora_model:VisualCLA LoRA模型所在目录,或🤗Model Hub上的模型名
  • --visualcla_model:使用合并脚本合并后的VisualCLA模型
    • 若未提供此参数,则模型将合并text_modelvision_modellora_model并用于推理
    • 若提供此参数,则加载的模型将以此参数为准,无需再提供 text_modelvision_modellora_model
  • --image_file(可选):模型读入的图片名,支持pngjpg等标准图片格式。不提供此参数时,模型将只基于文本内容进行回复。
  • --load_in_8bit(可选):LLM部分是否使用8bit推理
  • --gpus(可选):使用的GPU设备id,默认为0
  • --only_cpu(可选):是否仅使用CPU推理

模型部署

基于Gradio的网页demo

先安装依赖包

pip install gradio mdtex2html

启动方式:

python scripts/inference/gradio_demo.py --visualcla_model visualcla_model --load_in_8bit

参数说明:

  • --visualcla_model:使用合并脚本合并后的VisualCLA模型
  • --share(可选):是否创建公开可访问链接
  • --load_in_8bit(可选):LLM部分是否使用8bit推理
  • --gpus(可选):使用的GPU设备id,默认为0
  • --only_cpu(可选):是否仅使用CPU推理
  • --no_stream(可选):不使用流式输出形式

基于Text-Generation-webUI的模型部署

相比基于gradio_demo.py的部署方式,Text-Generation-webUI支持在多轮对话中使用多张图片。基于Text-Generation-webUI的模型部署的详细步骤请参考这里

效果展示

以下展示的均是v0.1测试版的效果

中文测试集

我们将LLaVA测试集和OwlEval测试集翻译成了中文,数据集下载以及模型在这两个数据集上的结果参见此处

局限性

虽然本项目中的模型具备一定的结合图像的多模态理解和生成能力,但也存在一定局限性,包括但不限于:

  • 存在幻觉问题,可能会生成与图像内容不符或不相关的内容,比如描述了图片中不存在的物体等
  • 预训练仍不充分,可能出现指令理解错误以及不能很好地结合图片回答等情况
  • 对图像中的精细的文字、公式、表格等内容的识别和理解准确率较低
  • 进行多轮对话后模型输出质量变差
  • 没有在线可互动的demo(注:用户仍然可以自行在本地部署)

引用

如果您觉得本项目对您的研究有所帮助或使用了本项目的代码或数据,请参考引用我们的工作

@article{chinese-llama-alpaca,
      title={Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca}, 
      author={Cui, Yiming and Yang, Ziqing and Yao, Xin},
      journal={arXiv preprint arXiv:2304.08177},
      url={https://arxiv.org/abs/2304.08177},
      year={2023}
}

@misc{visualcla,
  author = {Yang, Ziqing and Pan, Yuchen and Cui, Yiming},
  title = {Visual-Chinese-LLaMA-Alpaca},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/airaria/Visual-Chinese-LLaMA-Alpaca/}},
}

致谢

本项目基于以下开源项目二次开发,在此对相关项目和研究开发人员表示感谢。

免责声明

本项目相关资源仅供学术研究之用,严禁用于商业用途。 使用涉及第三方代码的部分时,请严格遵循相应的开源协议。模型生成的内容受模型计算、随机性和量化精度损失等因素影响,本项目不对其准确性作出保证。对于模型输出的任何内容,本项目不承担任何法律责任,亦不对因使用相关资源和输出结果而可能产生的任何损失承担责任。

本项目由个人及协作者业余时间发起并维护,因此无法保证能及时回复解决相应问题。

visual-chinese-llama-alpaca's People

Contributors

airaria avatar gogojoestar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

visual-chinese-llama-alpaca's Issues

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

$ python scripts/inference/inference.py --visualcla_model visualcla --image_file pics/examples/food.jpg --load_in_8bit
[INFO|tokenization_utils_base.py:1837] 2023-07-24 16:05:27,669 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:1837] 2023-07-24 16:05:27,669 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1837] 2023-07-24 16:05:27,669 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1837] 2023-07-24 16:05:27,669 >> loading file tokenizer_config.json
[WARNING|logging.py:295] 2023-07-24 16:05:27,670 >> You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
[INFO|tokenization_utils.py:426] 2023-07-24 16:05:27,697 >> Adding to the vocabulary
[INFO|tokenization_utils.py:426] 2023-07-24 16:05:27,697 >> Adding to the vocabulary
[INFO|tokenization_utils.py:426] 2023-07-24 16:05:27,697 >> Adding to the vocabulary
[INFO|tokenization_utils.py:426] 2023-07-24 16:05:27,697 >> Adding <img_token> to the vocabulary
2023-07-24 16:05:27,698 - INFO - visualcla.modeling_utils - Init VisualCLA model from pretrained
[INFO|configuration_utils.py:710] 2023-07-24 16:05:27,698 >> loading configuration file visualcla/config.json
[INFO|configuration_utils.py:768] 2023-07-24 16:05:27,699 >> Model config VisualCLAConfig {
"image_size": 224,
"initializer_range": 0.02,
"layer_norm_eps": 1e-12,
"model_type": "visualcla",
"text_config": {
"_name_or_path": "",
"add_cross_attention": false,
"architectures": [
"LlamaForCausalLM"
],
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": 1,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": 2,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_act": "silu",
"hidden_size": 4096,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 11008,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 2048,
"min_length": 0,
"model_type": "llama",
"no_repeat_ngram_size": 0,
"num_attention_heads": 32,
"num_beam_groups": 1,
"num_beams": 1,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": 0,
"prefix": null,
"pretraining_tp": 1,
"problem_type": null,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": false,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": "float16",
"torchscript": false,
"transformers_version": "4.31.0",
"typical_p": 1.0,
"use_bfloat16": false,
"use_cache": true,
"vocab_size": 49954
},
"tie_word_embeddings": false,
"transformers_version": "4.31.0",
"use_visual_resampler": true,
"vision_config": {
"_name_or_path": "",
"add_cross_attention": false,
"architectures": null,
"attention_dropout": 0.0,
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"dropout": 0.0,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_act": "quick_gelu",
"hidden_size": 1024,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"image_size": 224,
"initializer_factor": 1.0,
"initializer_range": 0.02,
"intermediate_size": 4096,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-05,
"length_penalty": 1.0,
"max_length": 20,
"min_length": 0,
"model_type": "clip_vision_model",
"no_repeat_ngram_size": 0,
"num_attention_heads": 16,
"num_beam_groups": 1,
"num_beams": 1,
"num_channels": 3,
"num_hidden_layers": 24,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"patch_size": 14,
"prefix": null,
"problem_type": null,
"projection_dim": 768,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": null,
"torchscript": false,
"transformers_version": "4.31.0",
"typical_p": 1.0,
"use_bfloat16": false
},
"visual_resampler_config": {
"hidden_size": 1024,
"intermediate_size": 4096,
"num_attention_heads": 16,
"num_hidden_layers": 6,
"num_query_tokens": 64
},
"vocab_size": 49958
}

[INFO|configuration_utils.py:710] 2023-07-24 16:05:27,844 >> loading configuration file visualcla/text_encoder/config.json
[INFO|configuration_utils.py:768] 2023-07-24 16:05:27,845 >> Model config LlamaConfig {
"_name_or_path": "chinese-alpaca-plus-7b",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 49954
}

[INFO|modeling_utils.py:2600] 2023-07-24 16:05:27,845 >> loading weights file visualcla/text_encoder/pytorch_model.bin.index.json
[INFO|modeling_utils.py:1172] 2023-07-24 16:05:27,845 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:599] 2023-07-24 16:05:27,846 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.31.0"
}

[INFO|modeling_utils.py:2715] 2023-07-24 16:05:28,053 >> Detected 8-bit loading: activating 8-bit loading for this model
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.24s/it]
[INFO|modeling_utils.py:3329] 2023-07-24 16:05:44,119 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:3337] 2023-07-24 16:05:44,119 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at visualcla/text_encoder.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:559] 2023-07-24 16:05:44,122 >> loading configuration file visualcla/text_encoder/generation_config.json
[INFO|configuration_utils.py:599] 2023-07-24 16:05:44,122 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.31.0"
}

[INFO|configuration_utils.py:710] 2023-07-24 16:05:44,188 >> loading configuration file visualcla/vision_encoder/config.json
[INFO|configuration_utils.py:768] 2023-07-24 16:05:44,188 >> Model config CLIPVisionConfig {
"_name_or_path": "clip-vit-large-patch14",
"architectures": [
"CLIPVisionModel"
],
"attention_dropout": 0.0,
"dropout": 0.0,
"hidden_act": "quick_gelu",
"hidden_size": 1024,
"image_size": 224,
"initializer_factor": 1.0,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"model_type": "clip_vision_model",
"num_attention_heads": 16,
"num_channels": 3,
"num_hidden_layers": 24,
"patch_size": 14,
"projection_dim": 768,
"torch_dtype": "float16",
"transformers_version": "4.31.0"
}

[INFO|modeling_utils.py:2600] 2023-07-24 16:05:44,188 >> loading weights file visualcla/vision_encoder/pytorch_model.bin
[INFO|modeling_utils.py:1172] 2023-07-24 16:05:44,483 >> Instantiating CLIPVisionModel model under default dtype torch.float16.
[INFO|modeling_utils.py:3329] 2023-07-24 16:05:45,066 >> All model checkpoint weights were used when initializing CLIPVisionModel.

[INFO|modeling_utils.py:3337] 2023-07-24 16:05:45,066 >> All the weights of CLIPVisionModel were initialized from the model checkpoint at visualcla/vision_encoder.
If your task is similar to the task the model of the checkpoint was trained on, you can already use CLIPVisionModel for predictions without further training.
[INFO|image_processing_utils.py:337] 2023-07-24 16:05:46,059 >> loading configuration file visualcla/preprocessor_config.json
[INFO|image_processing_utils.py:389] 2023-07-24 16:05:46,059 >> Image processor CLIPImageProcessor {
"crop_size": {
"height": 224,
"width": 224
},
"do_center_crop": true,
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"feature_extractor_type": "CLIPFeatureExtractor",
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "CLIPImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"shortest_edge": 224
}
}

2023-07-24 16:05:46,062 - INFO - main - *** Start Inference ***

========== Usage ==========

Start Inference with instruction mode.
You can enter instruction or special control commands after '>'. Below are the usage of the control commands

change image:[image_path] load the image from [image_path]
clear Clear chat history. This command will not change the image.
exit Exit Inference

Image: pics/examples/food.jpg

图片中有哪些食物
Traceback (most recent call last):
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/scripts/inference/inference.py", line 119, in
main()
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/scripts/inference/inference.py", line 110, in main
response, history = visualcla.chat(model, image=image_path, text=text, history=history)
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/models/visualcla/modeling_utils.py", line 167, in chat
outputs = model.generate(
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/models/visualcla/modeling_visualcla.py", line 382, in generate
outputs = self.text_model.generate(
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
return self.sample(
File "/home/yibo/Visual-Chinese-LLaMA-Alpaca/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2678, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

text_generation_webui部署的几个问题

1.FileNotFoundError: [Errno 2] No such file or directory: './models/visualcla_merged-7b/pytorch_model.bin'
对于合并权重的情况
需要cp visualcla/pytorch_model.bin models/visualcla_merged-7b/
不知道这样对不对

2.OSError: Can't load the configuration of './models/visualcla_merged-7b/vision_encoder'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './models/visualcla_merged-7b/vision_encoder' is the correct path to a directory containing a config.json file
对于合并权重的情况
需要cp -r ./visualcla/vision_encoder/ ./models/visualcla_merged-7b/
不知道这样对不对

3.OSError: ./models/visualcla_merged-7b does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./models/visualcla_merged-7b/main' for available files.
对于合并权重的情况
cp ./visualcla/preprocessor_config.json models/visualcla_merged-7b/
不知道这样对不对

4.KeyError: 'visual_resampler_config'
以上操作完了之后,重新运行server.py
$ python server.py --model=visualcla_merged-7b --multimodal-pipeline=visualcla-7b --chat --settings=settings-visualcla.yaml --share --load-in-8bit
2023-07-27 09:31:45 WARNING:The gradio "share link" feature uses a proprietary executable to create a reverse tunnel. Use it with care.
2023-07-27 09:31:47 INFO:Loading settings from settings-visualcla.yaml...
2023-07-27 09:31:47 INFO:Loading visualcla_merged-7b...
2023-07-27 09:38:36 WARNING:models/visualcla_merged-7b/special_tokens_map.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-07-27 09:38:36 INFO:Loaded the model in 408.25 seconds.

2023-07-27 09:38:36 INFO:Loading the extension "multimodal"...
2023-07-27 09:38:36 INFO:VisualCLA - Loading CLIP from ./models/visualcla_merged-7b/vision_encoder as torch.float32 on cuda:0...
2023-07-27 09:38:38 INFO:VisualCLA - Loading visual resampler from ./models/visualcla_merged-7b/ as torch.float32 on cuda:0...
Traceback (most recent call last):
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/server.py", line 1179, in
create_interface()
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/server.py", line 1086, in create_interface
extensions_module.create_extensions_block()
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/modules/extensions.py", line 175, in create_extensions_block
extension.ui()
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/extensions/multimodal/script.py", line 119, in ui
multimodal_embedder = MultimodalEmbedder(params)
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/extensions/multimodal/multimodal_embedder.py", line 27, in init
pipeline, source = load_pipeline(params)
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/extensions/multimodal/pipeline_loader.py", line 30, in load_pipeline
pipeline = getattr(pipeline_modules[k], 'get_pipeline')(shared.args.multimodal_pipeline, params)
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/extensions/multimodal/pipelines/visualcla/pipelines.py", line 11, in get_pipeline
return VisualCLA_7B_Pipeline(params)
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/extensions/multimodal/pipelines/visualcla/visualcla.py", line 140, in init
super().init(params)
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/extensions/multimodal/pipelines/visualcla/visualcla.py", line 30, in init
self.image_processor, self.vision_tower, self.visual_resampler, self.image_projection_layer = self._load_models()
File "/home/yibo/text-generation-webui-Visual-Chinese-LLaMA-Alpaca/extensions/multimodal/pipelines/visualcla/visualcla.py", line 47, in _load_models
visual_resampler_config = VisualResamplerConfig.from_dict(json.load(open(os.path.join(shared.settings['visualcla_merged_model'], 'config.json')))['visual_resampler_config'])
KeyError: 'visual_resampler_config'

配置文件config.json如下
more models/visualcla_merged-7b/config.json
{
"_name_or_path": "chinese-alpaca-plus-7b/",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"pad_token_id": 0,
"rms_norm_eps": 1e-06,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.30.2",
"use_cache": true,
"vocab_size": 49954
}

请帮忙看看,谢谢

训练/微调代码

你好,我想在模型上进行一些sft训练,想请问下是否可以开源预训练和微调的代码呢?

纯CPU推理报错

我依赖于colab中的visualcla_inference.ipynb,在执行到最后网页版的时候,使用了--only_cpu,但还是提示cuda的问题。

!python Visual-Chinese-LLaMA-Alpaca/scripts/inference/gradio_demo.py --only_cpu --visualcla_model visualcla --load_in_8bit --share
This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
Both `max_new_tokens` (=512) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Traceback (most recent call last):
  File "/content/Visual-Chinese-LLaMA-Alpaca/models/visualcla/modeling_utils.py", line 439, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/content/Visual-Chinese-LLaMA-Alpaca/models/visualcla/modeling_utils.py", line 222, in generate_with_callback
    model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/content/Visual-Chinese-LLaMA-Alpaca/models/visualcla/modeling_visualcla.py", line 382, in generate
    outputs = self.text_model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 194, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 402, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 296, in forward
    using_igemmlt = supports_igemmlt(A.device) and not state.force_no_igemmlt
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 226, in supports_igemmlt
    if torch.cuda.get_device_capability(device=device) < (7, 5):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 381, in get_device_capability
    prop = get_device_properties(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 395, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Response: 
History: [{'type': 'instruction', 'value': '这是什么?', 'first_instruction': True}]

合并模型

合并模型时,文本模型--text_model是否可以是原版llama模型合并过Chinese-llama加上Chinese-alpaca的完全模型还是只能是原版llama加上Chinese-alpaca的模型

模型能力评测和数据

想看下模型各项能力如何,例如视觉推理能力,OCR能力测评.以及预训练数据集和指令微调数据的训练代码以及数据可以开源一下吗 包括预训练和指令精调的

训练数据

请问训练数据来自哪里呢?中文captioning的数据质量相对较差啊

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.