Code Monkey home page Code Monkey logo

anole's Introduction

Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation

GAIR-Anole

๐Ÿ“Š Example ย  | ย  ๐Ÿค— Hugging Face ย  | ย  ๐Ÿ“ค Get Started ย  | ย  ๐ŸŒ Website ย  | ย  ๐Ÿ“„ Preprint ย  | ย 

This is the GAIR Anole project, which aims to build and opensource large multimodal models with comprehensive multimodal understanding and generation capabilities.

๐Ÿ‘‹ Overview

Anole is the first open-source, autoregressive, and natively trained large multimodal model capable of interleaved image-text generation (without using stable diffusion). While it builds upon the strengths of Chameleon, Anole excels at the complex task of generating coherent sequences of alternating text and images. Through an innovative fine-tuning process using a carefully curated dataset of approximately 6,000 images, Anole achieves remarkable image generation and understanding capabilities with minimal additional training. This efficient approach, combined with its open-source nature, positions Anole as a catalyst for accelerated research and development in multimodal AI. Preliminary tests demonstrate Anole's exceptional ability to follow nuanced instructions, producing high-quality images and interleaved text-image content that closely aligns with user prompts.

The major functionalities of Anole are listed below:

  • Text-to-Image Generation
  • Interleaved Text-Image Generation
  • Text Generation
  • MultiModal Understanding

where Bold represents newly added capabilities on the basis of Chameleon.

๐Ÿ“Š Examples

To better illustrate Anole's capabilities, here are some examples of its performance.

Note

We have provided open-source model weights, code, and detailed tutorials below to ensure that each of you can reproduce these results, and even fine-tune the model to create your own stylistic variations. (Democratization of technology is always our goal.)

Interleaved Image-Text Generation

Text2Image

More Examples

Click me

๐Ÿ” Methodology

Based on available information and our testings, the latest release of Chameleon have demonstrated strong performance in text understanding, text generation, and multimodal understanding. Anole, build on top of Chameleon, aiming to facilitate the image generation and multimodal generation capabilities from Chameleon.

Anole-Main

Chameleonโ€™s pre-training data natively includes both text and image modalities, theoretically equipping it with image generation capabilities. Our goal is to facilitate this ability without compromising its text understanding, generation, and multimodal comprehension. To achieve this, we froze most of Chameleonโ€™s parameters and fine-tuned only the logits corresponding to image token ids in transformerโ€™s output head layer.

Specifically, Anole-7b-v0.1 was developed using a small amount of image data (5,859 images, approximately 6 million image tokens) and was fine-tuned on just a few parameters (less than 40M) in a short time (around 30 minutes on 8 A100 GPUs). Despite this, Anole-7b-v0.1 expresses impressive image generation capabilities.

We are committed to continuously updating Anole to enhance its capabilities.

๐Ÿš€ Get started

Installation

  1. Download the model: Anole or Chameleon
git lfs install
git clone https://huggingface.co/GAIR/Anole-7b-v0.1

or

huggingface-cli download --resume-download GAIR/Anole-7b-v0.1 --local-dir Anole-7b-v0.1 --local-dir-use-symlinks False
  1. Install transformers from the chameleon branch (already included in this repo), chameleon library, and other requirements
git clone https://github.com/GAIR-NLP/anole.git
cd anole
bash install.sh

Inference on Anole

Our inference code is based on Meta Chameleon, which has been optimized and accelerated for inference. It also includes a visual viewer for debugging.

Checkpoint

To set your checkpoint path, modify constants.py. By default, the model loads a checkpoint from ./data.

A more flexible approach is to configure the checkpoint path via the .env file by setting CKPT_PATH, or you could directly

export CKPT_PATH=/path/to/your/Anole/ckpt

Text to Image

To generate images based on text, run the text2image.py script:

python text2image.py [-h] -i INSTRUCTION [-b BATCH_SIZE] [-s SAVE_DIR]
  • instruction: The instruction for image generation.
  • batch_size: The number of images to generate.
  • save_dir: The directory to save the generated images.

This command will generate batch_size images based on the same instruction at once, with a default of 10 images. For instance:

python text2image.py -i 'draw a dog'

Interleaved Image-Text Generation

To generate interleaved image-text content, run the interleaved_generation.py script:

python interleaved_generation.py [-h] -i INSTRUCTION [-s SAVE_DIR]
  • instruction: The instruction for interleaved image-text generation.
  • save_dir: The directory to save the generated images. For instance:
python interleaved_generation.py -i 'Please introduce the city of Gyumri with pictures.'

Multimodal-in and multimodal-out

We divide multimodal input into different segments according to different modes, and the type of each segment is "text" or "image". (See input.json for details.) You can control multimodal input by constructing such input files. To make the model do this inference, you can run the inference.py script:

python inference.py [-h] -i INPUT [-s SAVE_DIR]
  • input: The multimodal input file.
  • save_dir: The directory to save the generated images. For instance:
python inference.py -i input.json

Fine-tune Anole & Chameleon

Please follow the instructions in training and facilitating_image_generation. Note that we will continuously update this part.

Our fine-tuning code is developed based on transformers trainer and deepspeed and is largely inspired by pull request #31534 in transformers.

๐Ÿ› ๏ธ Models

Model Name HF Checkpoints License
Anole-7b-v0.1 ๐Ÿค— 7B Chameleon License

โญ๏ธ Next steps

  • Support multimodal inference using Hugging Face
  • Support conversion between Hugging Face model and PyTorch model

๐Ÿ“ Usage and License Notices

Anole is intended for research use only. Our model weights follow the same license as Chameleon. The fine-tuning images we used are from LAION-5B aesthetic, and thus follow the same license as LAION.

โš ๏ธ Disclaimer

Anole is still under development and has many limitations that need to be addressed. Importantly, we have not aligned the image generation capabilities of the Anole model to ensure safety and harmlessness. Therefore, we encourage users to interact with Anole with caution and report any concerning behaviors to help improve the model's safety and ethical considerations.

๐Ÿ™ Acknowledgements

  • We sincerely thank the Meta Chameleon Team for open-sourcing Chameleon, as most of our inference code is based on it.
  • We also greatly appreciate @zucchini-nlp and all the contributors who have contributed to pull request #31534 submitted to transformers. This PR is crucial for the development of our training code.

Citation

Please cite our paper if you find the repository helpful.

@article{chern2024anole,
  title={ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation},
  author={Chern, Ethan and Su, Jiadi and Ma, Yan and Liu, Pengfei},
  journal={arXiv preprint arXiv:2407.06135},
  year={2024}
} 

anole's People

Contributors

mantle2048 avatar ethanc111 avatar joyboy-su avatar koalazf99 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.