stability-ai / stablediffusion Goto Github PK

High-Resolution Image Synthesis with Latent Diffusion Models

License: MIT License

Python 100.00%

stablediffusion's Introduction

Stable Diffusion Version 2

This repository contains Stable Diffusion models trained from scratch and will be continuously updated with new checkpoints. The following list provides an overview of all currently available models. More coming soon.

News

March 24, 2023

Stable UnCLIP 2.1

New stable diffusion finetune (Stable unCLIP 2.1, Hugging Face) at 768x768 resolution, based on SD2.1-768. This model allows for image variations and mixing operations as described in Hierarchical Text-Conditional Image Generation with CLIP Latents, and, thanks to its modularity, can be combined with other models such as KARLO. Comes in two variants: Stable unCLIP-L and Stable unCLIP-H, which are conditioned on CLIP ViT-L and ViT-H image embeddings, respectively. Instructions are available here.
A public demo of SD-unCLIP is already available at clipdrop.co/stable-diffusion-reimagine

December 7, 2022

Version 2.1

New stable diffusion model (Stable Diffusion 2.1-v, Hugging Face) at 768x768 resolution and (Stable Diffusion 2.1-base, HuggingFace) at 512x512 resolution, both based on the same number of parameters and architecture as 2.0 and fine-tuned on 2.0, on a less restrictive NSFW filtering of the LAION-5B dataset. Per default, the attention operation of the model is evaluated at full precision when xformers is not installed. To enable fp16 (which can cause numerical instabilities with the vanilla attention module on the v2.1 model) , run your script with ATTN_PRECISION=fp16 python <thescript.py>

November 24, 2022

Version 2.0

New stable diffusion model (Stable Diffusion 2.0-v) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. SD 2.0-v is a so-called v-prediction model.
The above model is finetuned from SD 2.0-base, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
Added a x4 upscaling latent text-guided diffusion model.
New depth-guided stable diffusion model, finetuned from SD 2.0-base. The model is conditioned on monocular depth estimates inferred via MiDaS and can be used for structure-preserving img2img and shape-conditional synthesis.
A text-guided inpainting model, finetuned from SD 2.0-base.

We follow the original repository and provide basic inference scripts to sample from the models.

The original Stable Diffusion model was created in a collaboration with CompVis and RunwayML and builds upon the work:

High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, Björn Ommer
CVPR '22 Oral | GitHub | arXiv | Project page

and many others.

Stable Diffusion is a latent text-to-image diffusion model.

Requirements

You can update an existing latent diffusion environment by running

conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .

xformers efficient attention

For more efficiency and speed on GPUs, we highly recommended installing the xformers library.

Tested on A100 with CUDA 11.4. Installation needs a somewhat recent version of nvcc and gcc/g++, obtain those, e.g., via

export CUDA_HOME=/usr/local/cuda-11.4
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
conda install -c conda-forge gcc
conda install -c conda-forge gxx_linux-64==9.5.0

Then, run the following (compiling takes up to 30 min).

cd ..
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .
cd ../stablediffusion

Upon successful installation, the code will automatically default to memory efficient attention for the self- and cross-attention layers in the U-Net and autoencoder.

General Disclaimer

Stable Diffusion models are general text-to-image diffusion models and therefore mirror biases and (mis-)conceptions that are present in their training data. Although efforts were made to reduce the inclusion of explicit pornographic material, we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations. The weights are research artifacts and should be treated as such. Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding model card. The weights are available via the StabilityAI organization at Hugging Face under the CreativeML Open RAIL++-M License.

Stable Diffusion v2

Stable Diffusion v2 refers to a specific configuration of the model architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet and OpenCLIP ViT-H/14 text encoder for the diffusion model. The SD 2-v model produces 768x768 px outputs.

Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:

Text-to-Image

Stable Diffusion 2 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder. We provide a reference script for sampling.

Reference Sampling Script

This script incorporates an invisible watermarking of the outputs, to help viewers identify the images as machine-generated. We provide the configs for the SD2-v (768px) and SD2-base (512px) model.

First, download the weights for SD2.1-v and SD2.1-base.

To sample from the SD2.1-v model, run the following:

python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

or try out the Web Demo: .

To sample from the base model, use

python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/model.ckpt/> --config <path/to/config.yaml/>

By default, this uses the DDIM sampler, and renders images of size 768x768 (which it was trained on) in 50 steps. Empirically, the v-models can be sampled with higher guidance scales.

Note: The inference config for all model versions is designed to be used with EMA-only checkpoints. For this reason use_ema=False is set in the configuration, otherwise the code will try to switch from non-EMA to EMA weights.

Enable Intel® Extension for PyTorch* optimizations in Text-to-Image script

If you're planning on running Text-to-Image on Intel® CPU, try to sample an image with TorchScript and Intel® Extension for PyTorch* optimizations. Intel® Extension for PyTorch* extends PyTorch by enabling up-to-date features optimizations for an extra performance boost on Intel® hardware. It can optimize memory layout of the operators to Channel Last memory format, which is generally beneficial for Intel CPUs, take advantage of the most advanced instruction set available on a machine, optimize operators and many more.

Prerequisites

Before running the script, make sure you have all needed libraries installed. (the optimization was checked on Ubuntu 20.04). Install jemalloc, numactl, Intel® OpenMP and Intel® Extension for PyTorch*.

apt-get install numactl libjemalloc-dev
pip install intel-openmp
pip install intel_extension_for_pytorch -f https://software.intel.com/ipex-whl-stable

To sample from the SD2.1-v model with TorchScript+IPEX optimizations, run the following. Remember to specify desired number of instances you want to run the program on (more).

MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 python -m intel_extension_for_pytorch.cpu.launch --ninstance <number of an instance> --enable_jemalloc scripts/txt2img.py --prompt \"a corgi is playing guitar, oil on canvas\" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/intel/v2-inference-v-fp32.yaml  --H 768 --W 768 --precision full --device cpu --torchscript --ipex

To sample from the base model with IPEX optimizations, use

MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 python -m intel_extension_for_pytorch.cpu.launch --ninstance <number of an instance> --enable_jemalloc scripts/txt2img.py --prompt \"a corgi is playing guitar, oil on canvas\" --ckpt <path/to/model.ckpt/> --config configs/stable-diffusion/intel/v2-inference-fp32.yaml  --n_samples 1 --n_iter 4 --precision full --device cpu --torchscript --ipex

If you're using a CPU that supports bfloat16, consider sample from the model with bfloat16 enabled for a performance boost, like so

# SD2.1-v
MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 python -m intel_extension_for_pytorch.cpu.launch --ninstance <number of an instance> --enable_jemalloc scripts/txt2img.py --prompt \"a corgi is playing guitar, oil on canvas\" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/intel/v2-inference-v-bf16.yaml --H 768 --W 768 --precision full --device cpu --torchscript --ipex --bf16
# SD2.1-base
MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 python -m intel_extension_for_pytorch.cpu.launch --ninstance <number of an instance> --enable_jemalloc scripts/txt2img.py --prompt \"a corgi is playing guitar, oil on canvas\" --ckpt <path/to/model.ckpt/> --config configs/stable-diffusion/intel/v2-inference-bf16.yaml --precision full --device cpu --torchscript --ipex --bf16

Image Modification with Stable Diffusion

Depth-Conditional Stable Diffusion

To augment the well-established img2img functionality of Stable Diffusion, we provide a shape-preserving stable diffusion model.

Note that the original method for image modification introduces significant semantic changes w.r.t. the initial image. If that is not desired, download our depth-conditional stable diffusion model and the dpt_hybrid MiDaS model weights, place the latter in a folder midas_models and sample via

python scripts/gradio/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>

streamlit run scripts/streamlit/depth2img.py configs/stable-diffusion/v2-midas-inference.yaml <path-to-ckpt>

This method can be used on the samples of the base model itself. For example, take this sample generated by an anonymous discord user. Using the gradio or streamlit script depth2img.py, the MiDaS model first infers a monocular depth estimate given this input, and the diffusion model is then conditioned on the (relative) depth output.

depth2image

This model is particularly useful for a photorealistic style; see the examples. For a maximum strength of 1.0, the model removes all pixel-based information and only relies on the text prompt and the inferred monocular depth estimate.

Classic Img2Img

For running the "classic" img2img, use

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8 --ckpt <path/to/model.ckpt>

and adapt the checkpoint and config paths accordingly.

Image Upscaling with Stable Diffusion

After downloading the weights, run

python scripts/gradio/superresolution.py configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>

streamlit run scripts/streamlit/superresolution.py -- configs/stable-diffusion/x4-upscaling.yaml <path-to-checkpoint>

for a Gradio or Streamlit demo of the text-guided x4 superresolution model.
This model can be used both on real inputs and on synthesized examples. For the latter, we recommend setting a higher noise_level, e.g. noise_level=100.

Image Inpainting with Stable Diffusion

Download the SD 2.0-inpainting checkpoint and run

python scripts/gradio/inpainting.py configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>

streamlit run scripts/streamlit/inpainting.py -- configs/stable-diffusion/v2-inpainting-inference.yaml <path-to-checkpoint>

for a Gradio or Streamlit demo of the inpainting model. This scripts adds invisible watermarking to the demo in the RunwayML repository, but both should work interchangeably with the checkpoints/configs.

Shout-Outs

Thanks to Hugging Face and in particular Apolinário for support with our model releases!
Stable Diffusion would not be possible without LAION and their efforts to create open, large-scale datasets.
The DeepFloyd team at Stability AI, for creating the subset of LAION-5B dataset used to train the model.
Stable Diffusion 2.0 uses OpenCLIP, trained by Romain Beaumont.
Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase and https://github.com/lucidrains/denoising-diffusion-pytorch. Thanks for open-sourcing!
CompVis initial stable diffusion release
Patrick's implementation of the streamlit demo for inpainting.
img2img is an application of SDEdit by Chenlin Meng from the Stanford AI Lab.
Kat's implementation of the PLMS sampler, and more.
DPMSolver integration by Cheng Lu.
Facebook's xformers for efficient attention computation.
MiDaS for monocular depth estimation.

License

The code in this repository is released under the MIT License.

The weights are available via the StabilityAI organization at Hugging Face, and released under the CreativeML Open RAIL++-M License License.

BibTeX

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

stablediffusion's People

Contributors

Stargazers

Watchers

Forkers

nlile apolinario southmost discordinated victorchall atlanteantec tjennings liujiwen0517 ricklentz marcus-arcadius asears ikasumi leemengtw researchforumonline duongna21 76616c6172 anotherjesse hbcbh1999 crowbiz brycedrennan dsesclei uhthomas quitters mike4llison tyshiwo1 parlance-zz raxbits deltavml ausbitbank distyapps xiankgx olegjakushkin mindkhichdi artrixtech gammagec dgreen2017 georgew79 luodian simon-he95 philipluk jackliaoall-ai-stablediffusion pashu123 lwneal wb14123 george-han edwinyzh wolfgangmeyers enockipp neuroidss feiward letmegroove photoatomic john-520 dguo98 demonictoaster bencoster etells saumya66 abhishek-varma olksdr dudulightricks suryatmodulus pennywisdom pablonieto0981 sanjibnarzary kjerk dakshisdaksh eadz meefs petercao yibit teashawn dmalyavin enter-tainer oscrc forks-learning antonromanenkov lagait sahilbakoru pandinosaurus cnnrhill duraboys poiedk thespaceprincess shadynasrat dmingod richardsonjf mroosen cian0 efewijum celsopitta clarevoyance devxpy rmsulino standardgalactic phbou72 nextgen-networks cats-food layerlre ysdiamond

stablediffusion's Issues

Inpainting Masking

Hi, I have a question that may seem a bit obvious, but I would like some clarification.

In the ddim_sampling here: https://github.com/Stability-AI/stablediffusion/blob/main/ldm/models/diffusion/ddim.py#L157

you have

img = img_orig * mask + (1. - mask) * img

Why isn't it

img = img_orig * (1 - mask) + mask * img

instead?

By taking the original image and multiplying it to the mask (img_orig * mask), doesn't that mean that the new iteration of the image would be the same as the original, since the masked portion is replaced by the original image? I thought the latter would make more sense, so the mask should multiply by the new iteration of the image which gets updated for each iteration.

which torchtext version?

With torchtext 0.14.0 I got

Traceback (most recent call last):
  File "app.py", line 12, in <module>
    from pytorch_lightning import seed_everything
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics, void
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 29, in <module>
    from pytorch_lightning.utilities import rank_zero_deprecation
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
    from pytorch_lightning.utilities.apply_func import move_data_to_device  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 31, in <module>
    from torchtext.legacy.data import Batch
ModuleNotFoundError: No module named 'torchtext.legacy'

ModuleNotFoundError: No module named 'imwatermark'

Traceback (most recent call last):
  File "scripts/txt2img.py", line 14, in <module>
    from imwatermark import WatermarkEncoder
ModuleNotFoundError: No module named 'imwatermark'

Unintuitively, this is not solved by installing imWatermark, but:

%pip install invisible-watermark

Add an example Google Colab / Jupyter Notebook file

Similar to:

ModuleNotFoundError: No module named 'ldm.models.diffusion.dpm_solver'

cannot find dpm_solver module

The img2img results are full of color noise

The above figure is an example. After the step is set to 500, there is still no clear result.

What is the Minimum RAM required for running the V2

Hey,

Thank you for releasing this great model and making this publicly available.

We are able to run the 1.5 model on an EC2 instance G5XL with 16GB RAM, However when trying to deploy V2 on the same instance the process is killed - I tried deploying on an instance with larger RAM and it deployed ok.
So, my question is: is the minimum memory size increased in V2? what is the minimal RAM required? is there a recommended EC2 instance for running the mode?

requirements issue

i created a new conda env for this and downloaded the basic requirements but i am stuck on the pip install -e . step, receiving the following error:

does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

when will Image Upscaling with Stable Diffusion function be available?

Hi Stability-AI team😀

I have a question.
I would appreciate it if you could answer.

I want to use the Image Upscaling with Stable Diffusion function.
However, when I checked the link below, it still didn't seem to work.
https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler

It said "Use it with diffusers (coming soon)".

when will it be available?
And if there are any other demo pages that can use the upscaling function, please let me know.

Thank you.

May I ask if you will release the code about the training?

Model distillation details

You mention that the model has been distilled following https://arxiv.org/abs/2202.00512.
Can you give more details on that? What's the targeted number of steps? Does it goes as far as 4 steps distilled models like in the paper?
Has it been tuned distilled for guided diffusion models like in https://arxiv.org/abs/2210.03142?

Thanks!

Question: Why so many stable diffusions?

I've seen numerous Stable Diffusions around GitHub, many with significant amounts of stars as well, can someone explain what the difference is between these, and the pros and cons of each?

process is killled when i run txt2img.py

Hi, thanks for you great job !
when i run txt2img.py, it broken and killed. Do you know what's the problem ?

DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Killed

no such file or directory: /usr/src/app/.cache/huggingface/hub/models--laion--CLIP-ViT-H-14-laion2B-s32B-b79K/refs/main

I download open_clip_pytorch_model.bin from huggingface and saved it in ./laion/CLIP-ViT-H-14-laion2B-s32B-b79K, but got the erro: no such file or directory: /usr/src/app/.cache/huggingface/hub/models--laion--CLIP-ViT-H-14-laion2B-s32B-b79K/refs/main.

I found that in file factory.py of open_clip lib, the model is required to download in line 156, without checking the model exists locally or not. Anyone can help me?

Please make a working Windows 11 install

or instructions for such

Text 2 Mask integration?

I’ve seen elsewhere txt2mask capability integrated into the web application for SD 1.
Was wondering if we could get something similar integrated into SD2 but without having to use the web application and more so we can use the low level scripts. If people want to build on top of those- then that would be great- but in the interest of compatibility- a txt2mask script should be written with native SD in mind.

my use case is that I have a discord bot that takes my prompts etc- I would find it useful to provide a mask prompt - which can then be used as part of the inpainting. Having the txt2mask functionality in a web app is not useful to me - since I am still executing the python scripts that come with SD via a pipeline.

Wrong argument in txt2img.py

when running txt2img.py, the argument "--repeat", data = [p for p in data for i in range(opt.repeat)] should be data = [p for p in data for i in range(batch_size)], since we want the prompt to repeat exactly "batch_size" times, to be wrapped up with chunk() function. Otherwise, the shape issue will come up later on.
I have already submitted the PR 55 to potentially fix this.

windows xformers install breaking

hi, i am having issues installing xformers, breaking at pip install -e .

Obtaining file:///C:/Users/heart/Desktop/DEV/xformers
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\heart\Desktop\DEV\xformers\setup.py", line 270, in <module>
          ext_modules=get_extensions(),
        File "C:\Users\heart\Desktop\DEV\xformers\setup.py", line 210, in get_extensions
          cuda_version = get_cuda_version(CUDA_HOME)
        File "C:\Users\heart\Desktop\DEV\xformers\setup.py", line 67, in get_cuda_version
          raw_output = subprocess.check_output([nvcc_bin, "-V"], universal_newlines=True)
        File "C:\Users\heart\anaconda3\envs\stability\lib\subprocess.py", line 421, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "C:\Users\heart\anaconda3\envs\stability\lib\subprocess.py", line 503, in run
          with Popen(*popenargs, **kwargs) as process:
        File "C:\Users\heart\anaconda3\envs\stability\lib\subprocess.py", line 971, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "C:\Users\heart\anaconda3\envs\stability\lib\subprocess.py", line 1440, in _execute_child
          hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
      FileNotFoundError: [WinError 2] The system cannot find the file specified
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

How can i reproduce depth2img samples of yours??

Hello,

I'd like to reproduce your sample results of old_man.png with depth2img checkpoint.

I tried several runs but I couldn't generate good images.

I think starting from setting of your samples will help to make high quality samples.

May i ask the parameters of 'assets/stable-samples/depth2img'???

x4-upscaler encoder diverges for any input, because of half precision ?

Hi,
I'm playing around with the x4-upscaler model and I'm currently trying to pass an image into its encoder, but it keeps diverging for no reason, resulting in nan values. I tried doing some debugging and it gives me the following variables once I reach line 534 in "./ldm/modules/diffusionmodules/model.py":

hs is a list of the consecutive outputs of the res/downscale blocks, and we clearly see them diverging to nan values.
x is a tensor filled with zeros in this example, but I've tried with actual images and get roughly the same divergence process.

My first intuition was that the image processing I did wasn't right, but after double checking I am doing just as the inpainting script, eg scaling between -1 and 1, and casting to torch.float32 and 'cuda' device...

My second intuition was that maybe the encoder weights weren't given and thus were random. (?) But checking into the model's checkpoint they are here.

My last intuition is the fact that the tensors in hs are torch.float16, which is highly surprising to me given that mixed precision completely breaks the decoder and that it is thus not enabled.
Any clue on how and why they would be half precision ? on whether on not this could be the source of my problem ? and on my problem overall ?

Thanks in advance for any time you put into answering !

Installation seems unclear

What is a .ckpt file? There doesn't seem to be any in the repo but it's used in the example. Are we supposed to download these or are they under a different name?

Can a full command line

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8 --ckpt <path/to/model.ckpt>

Be simplified so that model.ckpt is already in the repro or else downloaded?

Inference with multiple GPUs

Does it support inference with multiple GPUs in txt2img process?

Any plans to public the training code?

Hi Stability-AI team,
Thank you for your outstanding work!

I looked at the code base and only found the test code, are there any plans to publish the training code?

Thanks again!

Watermark bias

I don't know if this is a good place to report such problems, but it seems that the network is overtrained on images containing watermarks. I'm posting an example of where it imprinted clearly recognizable dreamstime.com watermark.

error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.34.31933\\bin\\HostX86\\x64\\link.exe' failed with exit code 1120

How can i solve this error after execute pip install -e .

error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\bin\HostX86\x64\link.exe' failed with exit code 1120

the zero shot FID for the Stable Diffusion

In the NVIDIA paper "EDiffi", it said the zero-shot FID for the stable diffusion on the COCO2014 validation set can be 8.59.

But in the paper "Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces", it said the reimplemented result is 25.40.

Can you tell me the official result for the zero-shot COCO2014 validation set?

Streamlit SD-Upscale x4, CUDA out of memory. Tried to allocate 400.00 GiB

Normally the CUDA oom is a normal thing with smaller GPUs but... 400GiB? I dont think that exists as a gpu so this is obviously a bug.
512x512 input. Goes through every ddim step before kaboom.
Using conda made with the environment yaml. Running on a 4090 machine.
full log:
Traceback (most recent call last): File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 556, in _run_script exec(code, module.__dict__) File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 170, in <module> run() File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 152, in run result = paint( File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 109, in paint x_samples_ddim = model.decode_first_stage(samples) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\models\diffusion\ddpm.py", line 826, in decode_first_stage return self.first_stage_model.decode(z) File "z:\sd\sd_2.0\stablediffusion\ldm\models\autoencoder.py", line 90, in decode dec = self.decoder(z) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\modules\diffusionmodules\model.py", line 631, in forward h = self.mid.attn_1(h) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\modules\diffusionmodules\model.py", line 191, in forward w_ = torch.bmm(q,k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j] RuntimeError: CUDA out of memory. Tried to allocate 400.00 GiB (GPU 0; 23.99 GiB total capacity; 6.47 GiB already allocated; 0 bytes free; 17.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

which torchtext version?

Xformer performance decrease

After xformer compilation i have experienced memory usage improvement but performance actually decreased from ~15 it/s to ~12 it/s (512x512). Specs RTX 3090ti plus Intel Xeon Gold 5220R. Just inform:D

PLMS sampling is broken

Using the 768 v-diffusion model, using prompt "fruit basket".

With DDIM sampling:

With PLMS sampling:

CUDA out of memory? (3080ti 12GB)

I just installed Stable Diffusion 2.0 on my Linux box and it's sort of working.

I keep getting "CUDA out of memory" errors.
When using the txt2img example I had to decrease the resolution to 384x384 to avoid a crash.

With the x4 upscaler web interface I always end with a crash like:
CUDA out of memory. Tried to allocate 2.81 GiB (GPU 0; 11.77 GiB total capacity; 7.84 GiB already allocated...

I tried setting PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 (and other values) which allowed me to increase the txt2img resoluiton to 512x512, but x4 upscaler still crashes.

Is this normal behavior or do I simply need more VRAM?

My machine is 5900X, 32GB RAM, 3080ti 12GB, Pop!_OS 22.04 LTS.

[Bug] The "decode_first_stage" function in DDPM does not respect the "force_not_quantize" parameter

The decode_first_stage function in the ldm/models/diffusion/ddpm.py file looks like this.

    def decode_first_stage(self, z, predict_cids=False, force_not_quantize=False):
        if predict_cids:
            if z.dim() == 4:
                z = torch.argmax(z.exp(), dim=1).long()
            z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None)
            z = rearrange(z, 'b h w c -> b c h w').contiguous()

        z = 1. / self.scale_factor * z
        return self.first_stage_model.decode(z)

The predict_cids and force_not_quantize parameters are accepted but never used.

The last line in the old repo looks like this, which makes more sense:
return self.first_stage_model.decode(z, force_not_quantize=predict_cids or force_not_quantize)

So the question is, will it break anything applying this change?

ModuleNotFoundError: No module named 'torchtext.legacy'

Walked through the README and got this. I didn't use conda though to install pytorch might try to do that instead

!python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt models/ldm/768-v-ema.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

Traceback (most recent call last):
  File "scripts/txt2img.py", line 11, in <module>
    from pytorch_lightning import seed_everything
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics, void
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/metrics/utils.py", line 29, in <module>
    from pytorch_lightning.utilities import rank_zero_deprecation
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
    from pytorch_lightning.utilities.apply_func import move_data_to_device  # noqa: F401
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 31, in <module>
    from torchtext.legacy.data import Batch
ModuleNotFoundError: No module named 'torchtext.legacy'

https://colab.research.google.com/drive/10jKS9pAB2bdN3SHekZzoKzm4jo2F4W1Q?usp=sharing

Should call out the change in UNet model's attention heads

It is well known that in SD2, the text encoder changed, and downstream developers should take a notice and change the text encoder. But it is little known that the UNet model has changed as well. In particular, this line caused most troubles and can explain why a lot of people have problem running base model with their old code:

https://github.com/Stability-AI/stablediffusion/blob/main/configs/stable-diffusion/v2-inference.yaml#L32

Since for most implementations (SDv1 models), the multi-head attention is implemented as one matrix multiplication for many heads, the weights is unchanged and scripts can just take weights in SDv2 as is.

However, because we now fixed on number of head channels rather than number of heads, it will generate garbage values if people who ported Stable Diffusion to other platforms doesn't change their corresponding network configuration as well.

Saw a few mentions of they cannot make 512 base model work on HN and want to call it out here.

Please make a Dockerfile

This would allow people to spin up an instance very fast.

Regarding the Image Inpainting function, can I overlay images uploaded from my local PC?

Hi Stability-AI team.
I have a question about the Image Inpainting feature.

Is it possible to place an image on top of an image in this function?
The image I want to overlay is not generated by promt, but uploaded from my local PC.

ViT-H-14 Model Config missing

Model config for ViT-H-14 not found. Probably easily solvable. FIrst issue tho 😁

LatentDiffusion: Running in eps-prediction mode DiffusionWrapper has 865.91 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels ERROR:root:Model config for ViT-H-14 not found; available models ['RN50', 'RN50-quickgelu', 'RN50x4', 'RN50x16', 'RN101', 'RN101-quickgelu', 'timm-efficientnetv2_rw_s', 'timm-resnet50d', 'timm-resnetaa50d', 'timm-resnetblur50', 'timm-swin_base_patch4_window7_224', 'timm-vit_base_patch16_224', 'timm-vit_base_patch32_224', 'timm-vit_small_patch16_224', 'ViT-B-16', 'ViT-B-32', 'ViT-B-32-quickgelu', 'ViT-L-14']. Traceback (most recent call last): File "txt2img.py", line 290, in <module> main(opt) File "txt2img.py", line 191, in main model = load_model_from_config(config, f"{opt.ckpt}") File "txt2img.py", line 35, in load_model_from_config model = instantiate_from_config(config.model) File "/app/stablediffusion/ldm/util.py", line 79, in instantiate_from_config return get_obj_from_str(config["target"])(**config.get("params", dict())) File "/app/stablediffusion/ldm/models/diffusion/ddpm.py", line 563, in __init__ self.instantiate_cond_stage(cond_stage_config) File "/app/stablediffusion/ldm/models/diffusion/ddpm.py", line 630, in instantiate_cond_stage model = instantiate_from_config(config) File "/app/stablediffusion/ldm/util.py", line 79, in instantiate_from_config return get_obj_from_str(config["target"])(**config.get("params", dict())) File "/app/stablediffusion/ldm/modules/encoders/modules.py", line 147, in __init__ model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=version) File "/opt/conda/lib/python3.8/site-packages/open_clip/factory.py", line 133, in create_model_and_transforms model = create_model( File "/opt/conda/lib/python3.8/site-packages/open_clip/factory.py", line 83, in create_model raise RuntimeError(f'Model config for {model_name} not found.') RuntimeError: Model config for ViT-H-14 not found.

Leverage deepspeed for much faster inference time

https://github.com/microsoft/DeepSpeed
Deepspeed delivers considerable inference time reduction for both single gpu and cluster gpus.
It is the state of the art of inference optimization and is very easy to use, it's only a few lines of configuration!

Moreover it is exposed via Huggingface Accelerate
CompVis/stable-diffusion#180 (comment)

Fine tuning the model

Hello! First, thanks a lot for all your work !
Quick question, I tried to fine-tune the v2.0 of the model on new images using the same scripts I was using for the v1.4 v1.5 (dreambooth and textual inversion), but the results are very bad (almost only noise).
1/ Is it normal ? (what is different about the model architecture / training that makes the training scripts not working well with v2.0?
2/ What should I look into to adapt the fine tuning scripts to work with v2.0 ?

Thanks a lot for your answers!

DPM Solver doesn't support FP16 mode

I'm running on an 8Gb card (1070) so have limited VRAM.

One of the changes I make is to use FP16 (Half) to use less VRAM, so for example, in txt2img.py, I have modified the code like this:

seed_everything(opt.seed)

torch.set_default_tensor_type(torch.HalfTensor)

config = OmegaConf.load(f"{opt.config}")
model = load_model_from_config(config, f"{opt.ckpt}")
model = model.half()

the set_default... and model.half() lines are the additions.

This has generally worked okay, except when trying to use --dpm.

With --dpm, there is an error that comes from dpm_solver.py and any use of "torch.linspace"

torch.linspace when running on the CPU (as that part of the DPM Solver does) isn't supported for FP16.

As a workaround I have modified those lines to specify a dtype of float, like this:

self.t_array = torch.linspace(0., 1., self.total_N + 1,dtype=torch.float)[1:].reshape((1, -1))

This seems to be working okay, but I don't know if this is the best fix.

It would be good to get an official fix in the repo for this, and also a command line option to officially support using FP16.

Thanks

Still using invisible watermark package that rarely works?

I'm surprised version 2 is still using the https://github.com/ShieldMnt/invisible-watermark package which rarely works. The encoding scheme is just not at all robust. The github itself has years old issues submissions of people complaining that it doesn't work that remain unaddressed and unresolved.

Either excise the package completely, since it is almost completely useless, or write your own watermarking code. Maybe something more robust like hamming codes + pixel rounding? (Or you could at least fix the above package so it solves recursively to force it to properly decode into the watermark).

.half call to fix?

I am sorry that I cannot provide my python Traceback log (already reverted my code, sadly).

I was trying to update my previous code based on SD v1, I faced the error message :
Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same conv2d

Since I haven't modified the core part of the model, I thought this error came from the compatibility issue between my custom-made code.

However, I couldn't find where HalfTensor came from. And I found this line from SD v2 code a while after reverting my code :
return checkpoint(self._forward, (x,), self.parameters(), True) # TODO: check checkpoint usage, is True # TODO: fix the .half call!!!

I am wondering where is the ".half call" to fix; it might solve my 2nd attempt for the patch. 👍

ModuleNotFoundError: No module named 'omegaconf' (Please provide full installation instructions)

I tried to install it like this:

New conda environment with python 3.9
conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .
Trying to run the script, I get ModuleNotFoundError: No module named 'omegaconf'

Tokenizer in OpenCLIP seems to be appending 0s rather than eot tokens to the specified length

Previously, stablediffusion uses CLIP's tokenizer, that will appending eot tokens until the specified length (77). It seems that the newer one (at least the one in txt2img.py) uses SimpleTokenizer and will appending 0 until the specified length: https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/tokenizer.py#L183

Not sure what's the implication to the training process would be. I also checked the vocab, 0 does mean ! rather than any special tokens such as <start_of_text> or <end_of_text>.

[Feature Request]: img2txt - Image to text ?

with current technology would it be possible to ask the AI to generate a text from an image? in order to know what technology could describe the image, a tool for AI to describe the image for us.

(com a tecnologia atual seria possivel solicitar a IA gerar um texto a partir de uma imagem ? com a finalidade de saber o que a tecnologia poderia descrever da imagem, uma ferramenta para a IA descrever a imagem para a gente. )

Community Integration: Making AIGC cheaper, faster, and more efficient

Thank you for your rapid and outstanding contribution to Stable Diffusion 2.0!
AIGC has recently risen to be one of the hottest topics in AI. Unfortunately, large hardware requirements and training costs are still a severe impediment to the rapid growth of the AIGC industry. The Stable Diffusion v1 version of the model requires 150,000 A100 GPU Hours for a single training session.

We are happy to share a fantastic solution where the costs of training AIGC models such as stable diffusion can be 7 times cheaper!

Colossal-AI released a complete open-source Stable Diffusion pretraining and fine-tuning solution with the pretraining cost reduced by 6.5 times, and the hardware cost of fine-tuning by 7 times. An RTX 2070/3050 PC is good enough to accomplish the fine-tuning task flow, allowing AIGC models such as Stable Diffusion to be available to a wider community.

Open-source code：https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion

More details can be found on the blog. We are also very happy to provide such improvements for Stable Diffusion 2.0, and believe the democratization of AIGC models is also very helpful for Stable Diffusion 2.0 users. We would appreciate it if we could build the integration with you to benefit both of our users, and we are willing to provide help you need in this cooperation for free.

Thank you very much.

Best regards,
Yongbin Li, HPC-AI Tech

finger problem

Maybe it can solve the finger problem by adding the reverse use of attitude estimation and target detection model

COLLAB LINK for base model inference

https://colab.research.google.com/drive/1cuzz-TcAXsqlCPqqa8U7LltMmeZuBYkh?usp=sharing

depth2img mode with mask?

Is it possible to use depth2img with an image mask (i.e., in-painting)?

I'm trying to work my way through the scripts themselves and am still trying to grok what exactly is going on in text2img, img2img, and depth2img. What is the best resource for understanding the architecture of StableDiffusion, particularly as-implemented?

Linux Conda Xformers Install Issue

Everything goes fine installing, except xformers. When I run pip install -e ., this error occurs:

The detected CUDA version (11.4) mismatches the version that was used to compile
    PyTorch (10.2). Please make sure to use the same CUDA versions.