Code Monkey home page Code Monkey logo

profusion's Introduction

ProFusion

ProFusion (with an encoder pre-trained on a large dataset such as CC3M) can be used to efficiently construct customization dataset, which can be used to train a tuning-free customization assistant (CAFE).

Given a testing image, the assistant can perform customized generation in a tuning-free manner. It can take complex user-input, generate text explanation and elaboration along with image, without any fine-tuning.


examples

Results from CAFE



examples

Results from CAFE



Code for Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach.


examples

Results from ProFusion


ProFusion is a framework for customizing pre-trained large-scale text-to-image generation models, which is Stable Diffusion 2 in our examples.

framework

Illustration of the proposed ProFusion


With ProFusion, you can generate infinite number of creative images for a novel/unique concept, with single testing image, on single GPU (~20GB are needed when fine-tune with batch size 1).


examples

Results from ProFusion


Example

  • Install dependencies (we revised original diffusers);

      cd ./diffusers
      pip install -e .
      cd ..
      pip install accelerate==0.16.0 torchvision transformers==4.25.1 datasets ftfy tensorboard Jinja2 regex tqdm joblib 
    
  • Initialize Accelerate;

      accelerate config
    
  • Download a model pre-trained on FFHQ;

  • Customize model with a testing image, example is shown in the notebook test.ipynb;

Train Your Own Encoder

If you want to train a PromptNet encoder for other domains, or on your own dataset.

  • First, prepare an image-only dataset;

    • In our experiments on human face domain, we use FFHQ. Our pre-processed FFHQ can be found at google drive link;
    • We also trained an encoder on CC3M, which leads to good customization results on arbitrary downstream domain;
  • Then, run

      accelerate launch --mixed_precision="fp16" train.py\
            --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-base" \
            --train_data_dir=./images_512 \
            --max_train_steps=80000 \
            --learning_rate=2e-05 \
            --output_dir="./promptnet" \
            --train_batch_size=8 \
            --promptnet_l2_reg=0.000 \
            --gradient_checkpointing
    

Citation

@article{zhou2023enhancing,
  title={Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach},
  author={Zhou, Yufan and Zhang, Ruiyi and Sun, Tong and Xu, Jinhui},
  journal={arXiv preprint arXiv:2305.13579},
  year={2023}
}

profusion's People

Contributors

drboog avatar zhangry868 avatar

Stargazers

Genhub Tech avatar Xianyu Chen avatar  avatar  avatar Norman Zheng avatar Daeun Lee avatar Antonio Gonzalez  avatar Ökkeş Uğur Ulaş avatar nini2009ph avatar  avatar Vinson avatar Yuho avatar Artem Kotov avatar 王展鹏(Zhanpeng Wang) avatar  avatar  avatar Anh Pham avatar  avatar Chersh.Xu avatar hrz avatar  avatar 0xhephaistos avatar H3c avatar hqsrawmelon avatar Eric Stockmeyer avatar  avatar 莫小苝 avatar Cyanocitta Yinhao Wang avatar David Alegre Gimeno avatar cjeen avatar Jiayi Guo avatar Hu Huiyun avatar Evgeny Erohin avatar Indiscipline avatar 5l1v3r1 avatar Neaton avatar Said avatar Chenming Li avatar  avatar M Hazwan avatar Varun Talwar avatar Adam avatar Pebertli Barata avatar Nate Wildermuth avatar  avatar funnycat avatar Robert Dean avatar  avatar Anson Kao avatar  avatar Zakaria Kasmi avatar Terry Zhang avatar Danila avatar Youssef Assoul avatar Nuno Coração avatar Andres Serrano avatar  avatar Ethan Smith avatar  avatar elmer zhai avatar Shirokov Artemiy avatar XingyuRen avatar  avatar luwei-zcool avatar  avatar Ferry Huang avatar  avatar kk avatar CYY avatar Matthew avatar  avatar Jonnyshao avatar  avatar  avatar  avatar Vaibhav Bansal avatar Gavin avatar Milo Kaczorowski avatar  avatar Mu2u avatar  avatar Roozbeh avatar Charles Cai avatar  avatar Joe avatar Muhd Qari avatar llllllllllllllllllll avatar Jeff Carpenter avatar  avatar John Qiao avatar Kunal Goyal avatar  avatar Pavan Keerthi avatar Sicheng avatar Devansh Khandekar avatar  avatar Vitaly Bondar avatar  avatar 阿星 avatar Matt Shaffer avatar

Watchers

Siarhei Shchahrykovich avatar  avatar Jimmy Gunawan avatar  avatar dixit avatar  avatar vinch avatar Kostas Georgiou avatar Pyjcsx avatar  avatar SIDIK AL AMINI ZAILANI avatar Justin John avatar Keep Growing And Moving Forward avatar  avatar Matt Shaffer avatar Youssef Assoul avatar

profusion's Issues

fine_tuned model fuse faces

When I test multiple images, I first do a data augmentation with these two images at random, and then fine-tune my model on these data. But when I try to use different cfg combinations of these two pictures, I get pictures with almost the same kind of faces (combination of faces from both pictures) It seems that the cfg only changes their clothing and background.
I noticed that you provided an identity_small model in your paper, it seems to change the face (more like A or more like B)
according to different cfg.
image
I notice that this models is fine-tuned on some celebraties images. Can you provide this data? Or simply tell me how to generate my fine-tuning dataset so that the model does not fuse faces.

Text prompt used during training

Hi,

Thanks for the great work!
Could you say which prompt was used during training? Did you used the default prompt: "A photo of", or did you varied the text prompt randomaly on each batch.

Thanks!

Error with diffusers

Hi, how can I fix it?
ImportError: cannot import name 'StableDiffusionPromptNetPipeline' from 'diffusers'

License

Hi, this looks really interesting.
The license is prohibitive for a lot of folks.
Would you consider making it MIT or Apache 2.0?

Get high resolution photos

Great jobs!If I want to get higher resolution photos,like 1024*1024. Do I need to retrain the pretrained model?

Training on SD 1.5

Hello, I would like to know that whether I can train this model on stable-diffusion-v1-5, which is applicable to ControlNet. I have tried to change the base model, but encounter the dimension mismatch error, and it cannot be addressed simply like changing the clip model. Please offer me some help, thank you!

Which cosine similarity did you use?

Hi! @drboog Thank you for releasing the code. I read your paper with deep impression!!
While following the flow of the paper, I don't know which evaluation metric to use.
Is image-prompt similarity equal to torchmetrics.multimodal.clip_score.CLIPScore*0.01 ?

torchmetrics.multimodal.clip_score.CLIPScore is below.
\text{CLIPScore(I, C)} = max(100 * cos(E_I, E_C), 0)

Use with controlnet

I have runed your code and it's amazing!
Can we connect it with controlnet? I noticed that you override a StableDiffusionPromptNetPipeline, is it possible to create something like StableDiffusionPromptNetControlnetPipeline?

Role of ref_emb_scale

I am wondering if ref_emb_scale, which is ultimately provided as scale in this function,


is gamma in Algorithm1 of your paper.

Can you provide the details regarding this functionality?
The scale doesn't seem to have any effect on the reference prompt but rather have something to do with the original prompt.
The comments suggest that having a higher refine_emb_scale leads to using more information from the input image.
Why does scaling down (default value is 0.8) the second to last hidden state of the CLIP encoder output lead to using more of the input (reference) image?

Colab(?)

Can u provide a Google Colab please?

Implementation details

First, thanks for a very interesting paper. Looking through your code I see that you pass use _build_causal_attention_mask and pass this attention mask to text encoder during training, which indeed seems to make sense. But all official examples in 🤗 diffusers don't provide attention masks to text encoder during training (in TI and DB training scripts), why is that? I also see your comment # the causal mask is important, don't forget it if you try to customize the code Do you think passing this mask could improve the results for vanilla TI training as well? As a side note, in my custom TI training passing attention_mask for padded tokens significantly improves the convergence speed and final results, but now I think that in addition to that causal mask is needed.

Also in the paper you mention segmenting peoples / faces and inpainting then, but I can't see this in the train.py. Also have you tried just applying this mask to predictions, instead of using inpaining? it seems as faster and also potentially better approach. See this for details: cloneofsimo/lora#96

Training issue

Hi @drboog Thank you for releasing code of ProFusion.
I'm a graduate student currently studying the "ProFusion" paper. I would like to repoduce your work by training the encoder using FFHQ dataset. However, when I use the script you've provided, it will stuck at the warning message. Please tell me how to fix it.

image

multiple subject

I want to create two people simultaneously, is multiple subjects possible?

can't reproduce

Thanks for publishing this interesting code.

I immediately tried fine tuning from a single testing image, but could not reproduce it.

The image used is test_imgs/danielwu.jpg batch size is 1.
I use this as the execution script.

If there are any other necessary settings, please let me know.

what does '*_structure' mean?

Hi @zhangry868 @drboog , thanks for releasing the code! I read the code and found that in 'log_validation' function, you generate some images which end with 'structure.jpg'. Could you please teach me what this means? Thanks a lot!

Out of memory

I am using A10 with 40GB VRAM but am unable to run the fine-tuning part. getting this error

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 39.56 GiB total capacity; 36.60 GiB already allocated; 288.56 MiB free; 37.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

P.S. batch size of 1 worked

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.