Code Monkey home page Code Monkey logo

kosmos-g's Introduction

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Paper | Project Page

Checkpoints

Download checkpoints for stage1, stage2, and the final model.

mkdir kosmosg_checkpoints
cd kosmosg_checkpoints
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvVmlULUwtMTQtc2QucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O ViT-L-14-sd.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9zdGFnZTEucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O checkpoint_stage1.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9zdGFnZTIucHQ/c3Y9MjAyMy0wMS0wMyZzdD0yMDI0LTA0LTEwVDEzJTNBMTElM0E0NFomc2U9MjA1MC0wNC0xMVQxMyUzQTExJTNBMDBaJnNyPWMmc3A9ciZzaWc9NGNYSklqVlJaSElCV3FIalBnRG4lMkYwMW9jenBEV1hpcG1QQ1VrM1o4dmJRJTNE" | base64 --decode)
wget -O checkpoint_stage2.pt $DLINK
DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vc2cvY2hlY2twb2ludF9maW5hbC5wdD9zdj0yMDIzLTAxLTAzJnN0PTIwMjQtMDQtMTBUMTMlM0ExMSUzQTQ0WiZzZT0yMDUwLTA0LTExVDEzJTNBMTElM0EwMFomc3I9YyZzcD1yJnNpZz00Y1hKSWpWUlpISUJXcUhqUGdEbiUyRjAxb2N6cERXWGlwbVBDVWszWjh2YlElM0Q=" | base64 --decode)
wget -O checkpoint_final.pt $DLINK

Setup

Using Docker Image [Recommended]

You can use our built Docker Image

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ xichenpan/kosmosg:v1 /bin/bash
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
pip install torchscale/
pip install open_clip/
pip install fairseq/
pip install infinibatch/

You can also start with NVIDIA Official Docker Image, and install all dependencies manually.

docker run --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --name kosmosg --privileged=true -it -v /mnt:/mnt/ nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash
apt-get install -y libsm6 libxext6 libxrender-dev
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
bash vl_setup.sh

Using Base Environment

Make sure you have Pytorch 1.13.0 and nvcc 11.x installed.

git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-g
bash vl_setup.sh

Demo

If you would like to host a local Gradio demo, run the following command after setup:

bash runapp.sh

Be sure to adjust the guidance scale if you find the default one leads to over-saturated images.

Training

Preparing dataset

Refer to this guide to prepare the dataset.

Train script

After preparing the data, run the following command to train the model. Be sure to change the directories in the script to your own. For the image decoder aligning stage:

bash runalign.sh

For the instruction tuning stage:

bash runtrain.sh

Evaluation

FID score on COCO (2014) val set

Download and unzip the COCO (2014) val set:

mkdir coco
cd coco
wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip val2014.zip

Specify the cfg in sample_kosmosg_coco.py and run the script to evaluate:

bash runeval_coco.sh

DINO score, CLIP-I score and CLIP-T score on DreamBench

Download DreamBench:

mkdir dreambench
cd dreambench
git clone https://github.com/google/dreambooth.git

We keep only one image for each entity as described in our paper.

bash scripts/remove_dreambench_multiimg.sh /path/to/dreambench/dreambooth/dataset

Specify the cfg in sample_kosmosg_dreambench.py and run the script to evaluate:

bash runeval_dreambench.sh

Citation

If you find this repository useful, please consider citing our work:

@article{kosmos-g,
  title={{Kosmos-G}: Generating Images in Context with Multimodal Large Language Models},
  author={Xichen Pan and Li Dong and Shaohan Huang and Zhiliang Peng and Wenhu Chen and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.02992}
}

Disclaimer

Kosmos-G is purely a research project. Currently, we have no plans to incorporate Kosmos-G into a product or expand access to the public. We will also put Microsoft AI principles into practice when further developing the models.

In our research paper, we account for the ethical concerns associated with text-to-image research. To mitigate issues associated with training data, we have implemented a rigorous filtering process to purge our training data of inappropriate content, such as explicit imagery and offensive language, to minimize the likelihood of generating inappropriate content.

Acknowledgement

This repository is built using torchscale, fairseq, openclip. We thank the authors of Nerfies that kindly open sourced the template of the project page.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using models, please submit a GitHub issue.

kosmos-g's People

Contributors

xichenpan avatar

Stargazers

gengyuanmax avatar Yucheng Han avatar  avatar lulihua avatar Pengxiang Li avatar Rockey avatar Xinyu Huang avatar ~Cc avatar Yuxiang Nie avatar robbin han avatar Lei Li avatar Zilin Xiao avatar Li Dong avatar  avatar Junwei Zhou avatar xxxhy avatar Youming Deng avatar Shu avatar  avatar jihan.yang avatar  avatar

Watchers

Kostas Georgiou avatar  avatar

kosmos-g's Issues

About the alignernet training data form

Thank you for your great work!
I wanna know about the align training stage’s data form. Specifically, the data format loaded by laion2b_loader in the code and how to build it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.