Code Monkey home page Code Monkey logo

disenbooth's Introduction

DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation [ICLR 2024]

DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, Wenwu Zhu

Given several reference images of a customized subject, DisenBooth, a simple but effective finetuning method, can generate customized images for the subject, little influenced by the subject-irrelevant information. Besides, DisenBooth can also be used for subject inpainting, or background customization, etc. With flexible and controllable generation ability, DisenBooth shows great potential in many scenarios.

Applications

Subject-driven text-to-image generation

Given several images of a subject, our model can generate customized images of the subject with any text prompt.

Identity-irrelvant generation

DisenBooth can also generate images that contain the identity-irrelevant information of each reference image, which achieves promising disentanglement without additional supervision.

Customization with identity-irrelevant information

With the identity-irrelevant information, we can also combine the identity-irrelevant information and other subjects to generate new images, which results in text-driven image inpainting.

Customization when an image contain several subjects

When the input images have several subjects, we can also use DisenBooth to customize the multiple subjects. In the above example, we customize the background, the duck toy and the cup.

For more detailed description of these applications, please refer to our paper.

Installation

Install the conda virtual environment:

pip install diffusers(0.23.1)
pip install open_clip_torch
pip install torchvision

Usage

Finetune

To customize the stable diffusion model with a new subject, you can run the following script:

CUDA_VISIBLE_DEVICES=X bash train_disenbooth.sh

where INSTANCE_DIR is the path to the input image folder, OUT_DIR is the path to your saved checkpoints and validation images. The --instance_prompt and --validation_prompt should be adjusted according to your customized subjects. You should also specify a model name/path of Stable Diffusion in MODEL_NAME. In our work, we use Stable Diffusion 2-1 base, you can try any verison of stable diffusion, but remember to use the same CLIP image encoder in line 684 in train_disenbooth.py.

Inference

After finetuning, we provide a inference jupyter notebook for readers to easily reproduce the results.

Note that we give released version is based on diffusers 0.23.1, newer than the version 0.13.1 we use in the paper. Some results may have little differences from the paper.

Citation

If you find our work useful, please kindly cite our work:

@inproceedings{chen2023disenbooth,
  title={Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation},
  author={Chen, Hong and Zhang, Yipeng and Wu, Simin and Wang, Xin and Duan, Xuguang and Zhou, Yuwei and Zhu, Wenwu},
  booktitle={The Eleventh International Conference on Learning Representations},
  year={2023}
}

disenbooth's People

Contributors

forchchch avatar

Stargazers

Nana avatar charlieguo avatar  avatar ZhaoWei Zhou avatar  avatar  avatar  avatar  avatar DoBetter avatar Zirui Pan avatar  avatar  avatar  avatar Lu Ming avatar Zhengkai Jiang avatar Julian Tanke avatar Yuliang Xiu avatar Ajitabh Kumar avatar Kenneth-Wong avatar  avatar  avatar Zwette avatar Dong Yi avatar Elliott Zheng avatar Minkyu Kim avatar Wei Feng avatar  avatar Joshua Jackson avatar SungHwan Han avatar Andrew Ishutin avatar Haodong LI avatar  avatar Leo Lou avatar  avatar VideoDreamer avatar Jenny Sheng avatar qing_kong avatar  avatar  avatar  avatar Zeyang Zhang avatar hl-Chen avatar  avatar

Watchers

Andrew Ishutin avatar  avatar

disenbooth's Issues

CLIP version for CLIP-T metric

Hi. In Table 1 in the paper, the CLIP-T score of the pretrained SD is 0.352, which seems relatively high.
Could you share the CLIP version you used to get the CLIP embedding in CLIP-T?

Subject token

Hi!

Could you please explain to me which prompts did you use?
Specifically,

  1. You don't use "a photo of" prefix, do you?
  2. Do you use prompts like "a $class</w> $class" for every subject?

In paper you wrote:
image

In the script you use: "a dog</w> dog". This actually is tokenized as:

['a</w>', 'dog</w>', '</</w>', 'w</w>', '></w>', 'dog</w>']

Thus, "dog</w>" is a common token if I understand correctly.

Training steps per subject

Hi, I'm trying to reproduce the paper results, and it seems different training steps are used for each subject.
Could you share the training steps of the subjects from DreamBench?

Operational problem

Hello, following the Readme steps, I don't know why I didn't run it, except that I downloaded the local load because I couldn't link to huggingface.
Traceback (most recent call last):
File "/data/disk1/sxtang/Project/DisenBooth/train_disenbooth.py", line 1147, in
main(args)
File "/data/disk1/sxtang/Project/DisenBooth/train_disenbooth.py", line 1026, in main
img_state = img_adapter(img_state)
File "/data/disk1/sxtang/anaconda3/envs/Dreambooth2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/disk1/sxtang/Project/DisenBooth/disen_net.py", line 16, in forward
out_feature = self.adapter( self.sigmoid(self.mask)*feature ) + self.sigmoid(self.mask)*feature
RuntimeError: The size of tensor a (768) must match the size of tensor b (1024) at non-singleton dimension 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.