Code Monkey home page Code Monkey logo

stable-fast's Introduction

🚀Stable Fast

wheels Upload Python Package Open In Colab

Discord Channel

stable-fast achieves SOTA inference performance on ALL kinds of diffuser models, even with the latest StableVideoDiffusionPipeline. And unlike TensorRT or AITemplate, which takes dozens of minutes to compile a model, stable-fast only takes a few seconds to compile a model. stable-fast also supports dynamic shape, LoRA and ControlNet out of the box.

Model torch torch.compile AIT oneflow TensorRT stable-fast
SD 1.5 (ms) 1897 1510 1158 1003 991 995
SVD-XT (s) 83 70 47

NOTE: During benchmarking, TensorRT is tested with static batch size and CUDA Graph enabled while stable-fast is running with dynamic shape.

Introduction

What is this?

stable-fast is an ultra lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs. stable-fast provides super fast inference optimization by utilizing some key techniques and features:

  • CUDNN Convolution Fusion: stable-fast implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations of Conv + Bias + Add + Act computation patterns.
  • Low Precision & Fused GEMM: stable-fast implements a series of fused GEMM operators that compute with fp16 precision, which is fast than PyTorch's defaults (read & write with fp16 while compute with fp32).
  • Fused Linear GEGLU: stable-fast is able to fuse GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c) into one CUDA kernel.
  • NHWC & Fused GroupNorm: stable-fast implements a highly optimized fused NHWC GroupNorm + Silu operator with OpenAI's Triton, which eliminates the need of memory format permutation operators.
  • Fully Traced Model: stable-fast improves the torch.jit.trace interface to make it more proper for tracing complex models. Nearly every part of StableDiffusionPipeline/StableVideoDiffusionPipeline can be traced and converted to TorchScript. It is more stable than torch.compile and has a significantly lower CPU overhead than torch.compile and supports ControlNet and LoRA.
  • CUDA Graph: stable-fast can capture the UNet, VAE and TextEncoder into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. This implemention also supports dynamic shape.
  • Fused Multihead Attention: stable-fast just uses xformers and makes it compatible with TorchScript.

My next goal is to keep stable-fast as one of the fastest inference optimization frameworks for diffusers and also provide both speedup and VRAM reduction for transformers. In fact, I already use stable-fast to optimize LLMs and achieve a significant speedup. But I still need to do some work to make it more stable and easy to use and provide a stable user interface.

Differences With Other Acceleration Libraries

  • Fast: stable-fast is specialy optimized for HuggingFace Diffusers. It achieves a high performance across many libraries. And it provides a very fast compilation speed within only a few seconds. It is significantly faster than torch.compile, TensorRT and AITemplate in compilation time.
  • Minimal: stable-fast works as a plugin framework for PyTorch. It utilizes existing PyTorch functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.
  • Maximum Compatibility: stable-fast is compatible with all kinds of HuggingFace Diffusers and PyTorch versions. It is also compatible with ControlNet and LoRA. And it even supports the latest StableVideoDiffusionPipeline out of the box!

Installation

NOTE: stable-fast is currently only tested on Linux and WSL2 in Windows. You need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).

I only test stable-fast with torch>=2.1.0, xformers>=0.0.22 and triton>=2.1.0 on CUDA 12.1 and Python 3.10. Other versions might build and run successfully but that's not guaranteed.

Install Prebuilt Wheels

Download the wheel corresponding to your system from the Releases Page and install it with pip3 install <wheel file>.

Currently both Linux and Windows wheels are available.

# Change cu121 to your CUDA version and <wheel file> to the path of the wheel file.
# And make sure the wheel file is compatible with your PyTorch version.
pip3 install --index-url https://download.pytorch.org/whl/cu121 \
    'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3' \
    '<wheel file>'

Install From Source

# Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas

# Install PyTorch with CUDA and other packages at first.
# Windows user: Triton might be not available, you could skip it.
# NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.
pip3 install wheel 'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3'

# (Optional) Makes the build much faster.
pip3 install ninja

# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types.
# You can also install the latest stable release from PyPI.
# pip3 install -v -U stable-fast
pip3 install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)

NOTE: Any usage outside sfast.compilers is not guaranteed to be backward compatible.

NOTE: To get the best performance, xformers and OpenAI's triton>=2.1.0 need to be installed and enabled. You might need to build xformers from source to make it compatible with your PyTorch.

Usage

Optimize StableDiffusionPipeline

stable-fast is able to optimize StableDiffusionPipeline and StableDiffusionPipelineXL directly.

import time
import torch
from diffusers import (StableDiffusionPipeline,
                       EulerAncestralDiscreteScheduler)
from sfast.compilers.diffusion_pipeline_compiler import (compile,
                                                         CompilationConfig)

def load_model():
    model = StableDiffusionPipeline.from_pretrained(
        'runwayml/stable-diffusion-v1-5',
        torch_dtype=torch.float16)

    model.scheduler = EulerAncestralDiscreteScheduler.from_config(
        model.scheduler.config)
    model.safety_checker = None
    model.to(torch.device('cuda'))
    return model

model = load_model()

config = CompilationConfig.Default()
# xformers and Triton are suggested for achieving best performance.
try:
    import xformers
    config.enable_xformers = True
except ImportError:
    print('xformers not installed, skip')
try:
    import triton
    config.enable_triton = True
except ImportError:
    print('Triton not installed, skip')
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# But it can increase the amount of GPU memory used.
# For StableVideoDiffusionPipeline it is not needed.
config.enable_cuda_graph = True

model = compile(model, config)

kwarg_inputs = dict(
    prompt=
    '(masterpiece:1,2), best quality, masterpiece, best detailed face, a beautiful girl',
    height=512,
    width=512,
    num_inference_steps=30,
    num_images_per_prompt=1,
)

# NOTE: Warm it up.
# The initial calls will trigger compilation and might be very slow.
# After that, it should be very fast.
for _ in range(3):
    output_image = model(**kwarg_inputs).images[0]

# Let's see it!
# Note: Progress bar might work incorrectly due to the async nature of CUDA.
begin = time.time()
output_image = model(**kwarg_inputs).images[0]
print(f'Inference time: {time.time() - begin:.3f}s')

# Let's view it in terminal!
from sfast.utils.term_image import print_image

print_image(output_image, max_width=80)

Refer to examples/optimize_stable_diffusion_pipeline.py for more details.

You can check this Colab to see how it works on T4 GPU: Open In Colab

Optimize LCM Pipeline

stable-fast is able to optimize the newest latent consistency model pipeline and achieve a significant speedup.

Refer to examples/optimize_lcm_pipeline.py for more details about how to optimize normal SD model with LCM LoRA. Refer to examples/optimize_lcm_pipeline.py for more details about how to optimize the standalone LCM model.

Optimize StableVideoDiffusionPipeline

stable-fast is able to optimize the newest StableVideoDiffusionPipeline and achieve a 2x speedup

Refer to examples/optimize_stable_video_diffusion_pipeline.py for more details

Dynamically Switch LoRA

Switching LoRA dynamically is supported but you need to do some extra work. It is possible because the compiled graph and CUDA Graph share the same underlaying data (pointers) with the original UNet model. So all you need to do is to update the original UNet model's parameters inplace.

The following code assumes you have already load a LoRA and compiled the model, and you want to switch to another LoRA.

If you don't enable CUDA graph and keep preserve_parameters = True, things could be much easier. The following code might not even be needed.

# load_state_dict with assign=True requires torch >= 2.1.0

def update_state_dict(dst, src):
    for key, value in src.items():
        # Do inplace copy.
        # As the traced forward function shares the same underlaying data (pointers),
        # this modification will be reflected in the traced forward function.
        dst[key].copy_(value)

# Switch "another" LoRA into UNet
def switch_lora(unet, lora):
    # Store the original UNet parameters
    state_dict = unet.state_dict()
    # Load another LoRA into unet
    unet.load_attn_procs(lora)
    # Inplace copy current UNet parameters to the original unet parameters
    update_state_dict(state_dict, unet.state_dict())
    # Load the original UNet parameters back.
    # We use assign=True because we still want to hold the references
    # of the original UNet parameters
    unet.load_state_dict(state_dict, assign=True)

switch_lora(compiled_model.unet, lora_b_path)

Model Quantization

stable-fast extends PyTorch's quantize_dynamic functionality and provides a dynamically quantized linear operator on CUDA backend. By enabling it, you could get a slight VRAM reduction for diffusers and significant VRAM reduction for transformers, and cound get a potential speedup (not always).

For SD XL, it is expected to see VRAM reduction of 2GB with an image size of 1024x1024.

def quantize_unet(m):
    from diffusers.utils import USE_PEFT_BACKEND
    assert USE_PEFT_BACKEND
    m = torch.quantization.quantize_dynamic(m, {torch.nn.Linear},
                                            dtype=torch.qint8,
                                            inplace=True)
    return m

model.unet = quantize_unet(model.unet)
if hasattr(model, 'controlnet'):
    model.controlnet = quantize_unet(model.controlnet)

Refer to examples/optimize_stable_diffusion_pipeline.py for more details.

Some Common Methods To Speed Up PyTorch

# TCMalloc is highly suggested to reduce CPU overhead
# https://github.com/google/tcmalloc
LD_PRELOAD=/path/to/libtcmalloc.so python3 ...
import packaging.version
import torch

if packaging.version.parse(torch.__version__) >= packaging.version.parse('1.12.0'):
    torch.backends.cuda.matmul.allow_tf32 = True

Performance Comparison

Performance varies very greatly across different hardware/software/platform/driver configurations. It is very hard to benchmark accurately. And preparing the environment for benchmarking is also a hard job. I have tested on some platforms before but the results may still be inaccurate. Note that when benchmarking, the progress bar showed by tqdm may be inaccurate because of the asynchronous nature of CUDA. To solve this problem, I use CUDA Event to measure the speed of iterations per second accurately.

stable-fast is expected to work better on newer GPUs and newer CUDA versions. On older GPUs, the performance increase might be limited. During benchmarking, the progress bar might work incorrectly because of the asynchronous nature of CUDA.

RTX 4080 (512x512, batch size 1, fp16, in WSL2)

This is my personal gaming PC😄. It has a more powerful CPU than those from cloud server providers.

Framework SD 1.5 SD XL (1024x1024) SD 1.5 ControlNet
Vanilla PyTorch (2.1.0) 29.5 it/s 4.6 it/s 19.7 it/s
torch.compile (2.1.0, max-autotune) 40.0 it/s 6.1 it/s 21.8 it/s
AITemplate 44.2 it/s
OneFlow 53.6 it/s
AUTO1111 WebUI 17.2 it/s 3.6 it/s
AUTO1111 WebUI (with SDPA) 24.5 it/s 4.3 it/s
TensorRT (AUTO1111 WebUI) 40.8 it/s
TensorRT Official Demo 52.6 it/s
stable-fast (with xformers & Triton) 51.6 it/s 9.1 it/s 36.7 it/s

H100

Thanks for @Consceleratus and @harishp's help, I have tested speed on H100.

Framework SD 1.5 SD XL (1024x1024) SD 1.5 ControlNet
Vanilla PyTorch (2.1.0) 54.5 it/s 14.9 it/s 35.8 it/s
torch.compile (2.1.0, max-autotune) 66.0 it/s 18.5 it/s
stable-fast (with xformers & Triton) 104.6 it/s 21.6 it/s 72.6 it/s

A100

Thanks for @SuperSecureHuman and @jon-chuang's help, benchmarking on A100 is available now.

Framework SD 1.5 SD XL (1024x1024) SD 1.5 ControlNet
Vanilla PyTorch (2.1.0) 35.6 it/s 8.7 it/s 25.1 it/s
torch.compile (2.1.0, max-autotune) 41.9 it/s 10.0 it/s
stable-fast (with xformers & Triton) 61.8 it/s 11.9 it/s 41.1 it/s

Compatibility

Model Supported
Hugging Face Diffusers (1.5/2.1/XL) Yes
With ControlNet Yes
With LoRA Yes
Latent Consistency Model Yes
SDXL Turbo Yes
Stable Video Diffusion Yes
Functionality Supported
Dynamic Shape Yes
Text to Image Yes
Image to Image Yes
Image Inpainting Yes
UI Framework Supported Link
AUTOMATIC1111 WIP
SD Next Yes SD Next
ComfyUI Yes ComfyUI_stable_fast
Operating System Supported
Linux Yes
Windows Yes
Windows WSL Yes

Troubleshooting

Refer to doc/troubleshooting.md for more details.

And you can join the Discord Channel to ask for help.

stable-fast's People

Contributors

chengzeyi avatar skirsten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stable-fast's Issues

Trying to get it working on Windows

I've been trying to get it to work on Windows, but I'm currently stuck at this import error of some DLL file it seems. Does anyone know how to fix this issue, or some tips to try and get it working on Windows?

Traceback (most recent call last):
  File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\nodes.py", line 1800, in load_custom_node
    module_spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast\__init__.py", line 1, in <module>    from .node import ApplyStableFastUnet
  File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast\node.py", line 2, in <module>
    from sfast.compilers.stable_diffusion_pipeline_compiler import CompilationConfig
  File "C:\Users\USER\ComfyUI_windows_portable\python_embeded\Lib\site-packages\sfast\__init__.py", line 23, in <module>
    import sfast._C as _C
ImportError: DLL load failed while importing _C: The specified module could not be found.

Cannot import C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast module for custom nodes: DLL load failed while importing _C: The specified module could not be found.

Invert FORCE_CUDA=1 or improve install from source to fail if cuda extensions can't be built

So I was debugging why the deployed model inference was slower than in a Jupyter Notebook.
The reason was that the CI that is building the container image did not build the cuda extensions (because of this).

It would probably be better to invert this logic to throw an error if It cannot build the cuda extensions.
Otherwise many users might not understand why their performance is worse than what should be possible.

This could be bypassed with a env var like OPTIONAL_CUDA=1 or DISABLE_CUDA=1.

[composability] stable-fast + sd-turbo device mismatch

2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2023-12-07 17:35:10.128 [stderr   ]     return func(*args, **kwargs)
2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py", line 926, in __call__
2023-12-07 17:35:10.128 [stderr   ]     image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 36, in dynamic_graphed_callable
2023-12-07 17:35:10.128 [stderr   ]     cached_callable = simple_make_graphed_callable(
2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 53, in simple_make_graphed_callable
2023-12-07 17:35:10.128 [stderr   ]     return make_graphed_callable(callable,
2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 98, in make_graphed_callable
2023-12-07 17:35:10.129 [stderr   ]     static_inputs = shadow_copy(static_inputs_)
2023-12-07 17:35:10.129 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 49, in shadow_copy
2023-12-07 17:35:10.129 [stderr   ]     return type(obj)(shadow_copy(x, detach=detach) for x in obj)
2023-12-07 17:35:10.129 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 49, in <genexpr>
2023-12-07 17:35:10.129 [stderr   ]     return type(obj)(shadow_copy(x, detach=detach) for x in obj)
2023-12-07 17:35:10.129 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 45, in shadow_copy
2023-12-07 17:35:10.129 [stderr   ]     return sfast._C._create_shadow_tensor(
2023-12-07 17:35:10.129 [stderr   ] RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

Failing to compile

Using Linux 22.04, Torch 2., diffusers 0.22.1, xformers 0.0.22.post7, triton 2.1
I've posted the entire log of the build hoping it helps isolate what I'm doing wrong.

Using pip 23.3.1 from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/pip (python 3.10)
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting stable-fast
  Cloning https://github.com/chengzeyi/stable-fast.git (to revision main) to /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
  Running command git version
  git version 2.34.1
  Running command git clone --filter=blob:none https://github.com/chengzeyi/stable-fast.git /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
  Cloning into '/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273'...
  Running command git show-ref main
  52afadfe0f49b2aa13e9ac15566cedb4e0732784 refs/heads/main
  52afadfe0f49b2aa13e9ac15566cedb4e0732784 refs/remotes/origin/main
  Running command git symbolic-ref -q HEAD
  refs/heads/main
  Resolved https://github.com/chengzeyi/stable-fast.git to commit 52afadfe0f49b2aa13e9ac15566cedb4e0732784
  Running command git rev-parse HEAD
  52afadfe0f49b2aa13e9ac15566cedb4e0732784
  Running command python setup.py egg_info
  running egg_info
  creating /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info
  writing /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/dependency_links.txt
  writing requirements to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/requires.txt
  writing top-level names to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/top_level.txt
  writing manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
  reading manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
  Preparing metadata (setup.py) ... done
Requirement already satisfied: packaging in ./venv/lib/python3.10/site-packages (from stable-fast) (23.2)
Requirement already satisfied: torch>=1.12.0 in ./venv/lib/python3.10/site-packages (from stable-fast) (2.1.0)
Requirement already satisfied: filelock in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.13.1)
Requirement already satisfied: typing-extensions in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (4.8.0)
Requirement already satisfied: sympy in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (1.12)
Requirement already satisfied: networkx in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.2.1)
Requirement already satisfied: jinja2 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.1.2)
Requirement already satisfied: fsspec in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2023.10.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (8.9.2.26)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (11.0.2.54)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (10.3.2.106)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (11.4.5.107)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.0.106)
Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2.18.1)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: triton==2.1.0 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2.1.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in ./venv/lib/python3.10/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.12.0->stable-fast) (12.3.52)
Requirement already satisfied: MarkupSafe>=2.0 in ./venv/lib/python3.10/site-packages (from jinja2->torch>=1.12.0->stable-fast) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in ./venv/lib/python3.10/site-packages (from sympy->torch>=1.12.0->stable-fast) (1.3.0)
Building wheels for collected packages: stable-fast
  Running command python setup.py bdist_wheel
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-cpython-310
  creating build/lib.linux-x86_64-cpython-310/sfast
  copying sfast/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast
  creating build/lib.linux-x86_64-cpython-310/sfast/jit
  copying sfast/jit/trace_helper.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
  copying sfast/jit/utils.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
  copying sfast/jit/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
  creating build/lib.linux-x86_64-cpython-310/sfast/compilers
  copying sfast/compilers/stable_diffusion_pipeline_compiler.py -> build/lib.linux-x86_64-cpython-310/sfast/compilers
  copying sfast/compilers/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/compilers
  creating build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/aot_printer.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/patch.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/copy_func.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/custom_python_operator.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/torch_dispatch.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/flat_tensors.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/gpu_device.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/memory_format.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/xformers_attention.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/env.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/compute_precision.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  creating build/lib.linux-x86_64-cpython-310/sfast/triton
  copying sfast/triton/torch_ops.py -> build/lib.linux-x86_64-cpython-310/sfast/triton
  copying sfast/triton/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton
  creating build/lib.linux-x86_64-cpython-310/sfast/dynamo
  copying sfast/dynamo/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo
  creating build/lib.linux-x86_64-cpython-310/sfast/cuda
  copying sfast/cuda/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/cuda
  copying sfast/cuda/graphs.py -> build/lib.linux-x86_64-cpython-310/sfast/cuda
  creating build/lib.linux-x86_64-cpython-310/sfast/jit/passes
  copying sfast/jit/passes/triton_passes.py -> build/lib.linux-x86_64-cpython-310/sfast/jit/passes
  copying sfast/jit/passes/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/jit/passes
  creating build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/image_to_ansi.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/imgcat.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/kdtree.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/climage.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  creating build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  copying sfast/triton/modules/patch.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  copying sfast/triton/modules/native.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  copying sfast/triton/modules/diffusers.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  copying sfast/triton/modules/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  creating build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/group_norm.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/copy.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/activation.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/conv.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  creating build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
  copying sfast/dynamo/backends/sfast_jit.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
  copying sfast/dynamo/backends/registry.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
  copying sfast/dynamo/backends/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
  running build_ext
  /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.3) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
    warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
  /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.3
    warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
  building 'sfast._C' extension
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn
  Emitting ninja build file /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/13] /usr/local/cuda/bin/nvcc  -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
  FAILED: /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o
  /usr/local/cuda/bin/nvcc  -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
  In file included from /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu:7:
  /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/ATen/cudnn/cudnn-wrapper.h:3:10: fatal error: cudnn.h: No such file or directory
      3 | #include <cudnn.h>
        |          ^~~~~~~~~
  compilation terminated.
  [2/13] /usr/local/cuda/bin/nvcc  -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
  /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu(291): warning #177-D: variable "falpha" was declared but never referenced
      float falpha = alpha;
            ^

  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

  /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu(292): warning #177-D: variable "fbeta" was declared but never referenced
      float fbeta = beta;
            ^

  [3/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [4/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [5/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [6/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [7/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [8/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [9/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [10/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  In file included from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
                   from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/jit/python/pybind_utils.h:26,
                   from /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.cpp:10:
  /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
    104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
        |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
  [11/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [12/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [13/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp:641:55: warning: "/*" within comment [-Wcomment]
    641 |               beta_, self.scalar_type(), c10::nullopt /* layout */
        |
  /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp:642:28: warning: "/*" within comment [-Wcomment]
    642 | /*, at::kCPU, c10::nullopt /* pin_memory */ /*));
        |
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
      subprocess.run(
    File "/usr/lib/python3.10/subprocess.py", line 526, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/setup.py", line 109, in <module>
      setup(
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
      return distutils.core.setup(**attrs)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 369, in run
      self.run_command("build")
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 131, in run
      self.run_command(cmd_name)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 88, in run
      _build_ext.run(self)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
      self.build_extensions()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
      build_ext.build_extensions(self)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
      self._build_extensions_serial()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
      self.build_extension(ext)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 249, in build_extension
      _build_ext.build_extension(self, ext)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
      super(build_ext, self).build_extension(ext)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
      objects = self.compiler.compile(
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/bin/python3 -u -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize
  
  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)
  
  __file__ = %r
  sys.argv[0] = __file__
  
  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"
  
  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' bdist_wheel -d /tmp/pip-wheel-_zs0b_wo
  cwd: /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/
  Building wheel for stable-fast (setup.py) ... error
  ERROR: Failed building wheel for stable-fast
  Running setup.py clean for stable-fast
  Running command python setup.py clean
  running clean
  removing 'build/temp.linux-x86_64-cpython-310' (and everything under it)
  removing 'build/lib.linux-x86_64-cpython-310' (and everything under it)
  'build/bdist.linux-x86_64' does not exist -- can't clean it
  'build/scripts-3.10' does not exist -- can't clean it
  removing 'build'
Failed to build stable-fast
ERROR: Could not build wheels for stable-fast, which is required to install pyproject.toml-based projects

Cant get it runnig... anyone can help please? RTX 3090

Hi, I am trying to run python3 optimize_stable_diffusion_pipeline.py
and I get this nasty error where I cant really tell much what exactly is wrong as it reffers to like everyting included here..

I am pretty sure I have correct version of stable fast, the one corresponding with my python 3.1, cuda 12.1 and torch 2.1, gcc is 11.4..
My system is RTX 3090, i7 etc... quite clean install of ubuntu. Nvidia drivers are 530.

this is my pip3 list:

pip3 list
Package Version


accelerate 0.25.0
antlr4-python3-runtime 4.9.3
apturl 0.5.2
bcrypt 3.2.0
blinker 1.4
Brlapi 0.8.3
certifi 2020.6.20
chardet 4.0.0
click 8.0.3
colorama 0.4.4
command-not-found 0.3
cryptography 3.4.8
cupshelpers 1.0
dbus-python 1.2.18
defer 1.0.6
diffusers 0.24.0
distro 1.7.0
distro-info 1.1+ubuntu0.1
duplicity 0.8.21
fasteners 0.14.1
filelock 3.13.1
fsspec 2023.12.1
future 0.18.2
httplib2 0.20.2
huggingface-hub 0.19.4
idna 3.3
importlib-metadata 4.6.4
jeepney 0.7.1
Jinja2 3.1.2
keyring 23.5.0
language-selector 0.1
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lockfile 0.12.2
louis 3.20.0
macaroonbakery 1.3.1
Mako 1.1.3
MarkupSafe 2.0.1
monotonic 1.6
more-itertools 8.10.0
mpmath 1.3.0
netifaces 0.11.0
networkx 3.2.1
numpy 1.26.2
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.0
olefile 0.46
omegaconf 2.3.0
packaging 23.2
paramiko 2.9.3
pexpect 4.8.0
Pillow 9.0.1
pip 22.0.2
protobuf 3.12.4
psutil 5.9.6
ptyprocess 0.7.0
pycairo 1.20.1
pycups 2.0.1
PyGObject 3.42.1
PyJWT 2.3.0
pymacaroons 0.13.0
PyNaCl 1.5.0
pyparsing 2.4.7
PyQt5 5.15.10
PyQt5-Qt5 5.15.2
PyQt5-sip 12.13.0
pyRFC3339 1.1
python-apt 2.4.0+ubuntu2
python-dateutil 2.8.1
python-debian 0.1.43+ubuntu1.1
pytz 2022.1
pyxdg 0.27
PyYAML 5.4.1
regex 2023.10.3
reportlab 3.6.8
requests 2.25.1
safetensors 0.4.1
screen-resolution-extra 0.0.0
SecretStorage 3.3.1
setuptools 59.6.0
six 1.16.0
ssh-import-id 5.11
stable-fast 0.0.13.post3+torch210cu121
sympy 1.12
systemd-python 234
tokenizers 0.15.0
torch 2.1.0
torchvision 0.16.0
tqdm 4.66.1
transformers 4.35.2
triton 2.1.0
typing_extensions 4.8.0
ubuntu-advantage-tools 8001
ubuntu-drivers-common 0.0.0
ufw 0.36.1
unattended-upgrades 0.1
urllib3 1.26.5
usb-creator 0.3.7
wadllib 1.3.6
wheel 0.37.1
xdg 5
xformers 0.0.22.post7
xkit 0.0.0
zipp 1.0.0

and error is below:

Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.96it/s]
/home/sd/.local/lib/python3.10/site-packages/torch/cuda/graphs.py:88: UserWarning: The CUDA Graph is empty. This usually means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:192.)
super().capture_end()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:159: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:218: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:228: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:214: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/home/sd/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/home/sd/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:23: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:253: TracerWarning: torch.Tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return super().new(cls, x, *args, **kwargs)
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:123: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
0%| | 0/30 [00:00<?, ?it/s]/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:197: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bool(tensors[start].item()), start + 1
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py:878: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if dim % default_overall_up_factor != 0:
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:271: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:173: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:186: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if hidden_states.shape[0] >= 64:
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: cannot find -lcuda: No such file or directory
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
collect2: error: ld returned 1 exit status
0%| | 0/30 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py", line 150, in
main()
File "/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py", line 132, in main
model(**get_kwarg_inputs())
File "/home/sd/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 918, in call
noise_pred = self.unet(
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 29, in dynamic_graphed_callable
cached_callable = simple_make_graphed_callable(
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 46, in simple_make_graphed_callable
return make_graphed_callable(callable,
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 75, in make_graphed_callable
callable(*tree_copy(example_inputs),
File "/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 62, in wrapper
return traced_module(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 119, in forward
outputs = self.module(*self.convert_inputs(args, kwargs))
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):

graph(%input, %num_groups, %weight, %bias, %eps, %cudnn_enabled):
%y : Tensor = sfast_triton::group_norm_silu(%input, %num_groups, %weight, %bias, %eps)
~~~~~~~~~~~~ <--- HERE
return (%y)
RuntimeError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpy9k09g46/main.c', '-O3', '-I/home/sd/.local/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/usr/include/python3.10', '-I/tmp/tmpy9k09g46', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpy9k09g46/group_norm_4d_channels_last_forward_collect_stats_kernel.cpython-310-x86_64-linux-gnu.so', '-L/lib/x86_64-linux-gnu', '-L/lib/i386-linux-gnu', '-L/lib/i386-linux-gnu']' returned non-zero exit status 1.

At:
/usr/lib/python3.10/subprocess.py(369): check_call
/home/sd/.local/lib/python3.10/site-packages/triton/common/build.py(90): _build
/home/sd/.local/lib/python3.10/site-packages/triton/compiler/make_launcher.py(39): make_stub
/home/sd/.local/lib/python3.10/site-packages/triton/compiler/compiler.py(425): compile
(63): group_norm_4d_channels_last_forward_collect_stats_kernel
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/init.py(35): new_func
/home/sd/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py(232): run
/home/sd/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py(232): run
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/ops/group_norm.py(425): group_norm_forward
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/torch_ops.py(188): forward
/home/sd/.local/lib/python3.10/site-packages/torch/autograd/function.py(539): apply
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/torch_ops.py(226): group_norm_silu
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py(119): forward
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py(62): wrapper
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(75): make_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(46): simple_make_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(29): dynamic_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(918): call
/home/sd/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py(115): decorate_context
/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py(132): main
/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py(150):

can anyone at least guid me what could be wrong? Thanks

Update torch.compile benchmark on A100 40GB SDv1.5 for torch nightly

100%|██████████| 50/50 [00:00<00:00, 58.00it/s]

58 Iterations per second. (with torch.cuda.synchronize)


Settings:

A100 40GB, fp16, batch size 1
height=512,
width=512,
num_inference_steps=50,
num_images_per_prompt=1,

Wall clock time for full pipeline: 862ms. (no torch.cuda.synchronize). 927ms (with torch.cuda.synchronize)

torch version: torch==2.2.0.dev20231203+cu121

Also, please update wallclock time and iteration/s for other GPUs on torch nightly. If extrapolation serves us right, it ought to be faster than TensorRT.


Outdated
2023-12-04 03:24:33.549 [stderr   ] 100%|██████████| 50/50 [00:00<00:00, 79.00it/s]

79 Iterations per second. (naive measurement)

Wall clock time for full pipeline: 862ms.


torch eager and not channels first: wallclock time: 1763ms

100%|██████████| 50/50 [00:01<00:00, 29.55it/s]

(naive measurement)

RuntimeError: _Map_base::at`

hi, i was trying this out for maximum optimization in aws G5 instance on ubuntu (it's just an nvidia A10g) and i was using comfy ui by calling on the nodes itself in python code and i kept getting this error message that i coudn't solve.
how would i be able to resolve this?
all the dependencies regarding the pytorch and 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0' 'torch>=1.12.0' was met and it worked on my desktop but does not in ubuntu.

/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:157: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:216: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:226: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:212: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:203: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return int(tensors[start].item()), start + 1
0%| | 0/12 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ubuntu/test/workflow_clip_sdxl2.py", line 320, in
main()
File "/home/ubuntu/test/workflow_clip_sdxl2.py", line 229, in main
ksampler_3 = ksampler.sample(
File "/home/ubuntu/ComfyUI/nodes.py", line 1286, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
File "/home/ubuntu/ComfyUI/nodes.py", line 1256, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI-Impact-Pack/modules/impact/sample_error_enhancer.py", line 22, in informative_sample
raise e
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI-Impact-Pack/modules/impact/sample_error_enhancer.py", line 9, in informative_sample
return original_sample(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/sample.py", line 100, in sample
samples = sampler.sample(noise, positive_copy, negative_copy, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 711, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 617, in sample
samples = sampler.sample(model_wrap, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 556, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/k_diffusion/sampling.py", line 137, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 277, in forward
out = self.inner_model(x, sigma, cond=cond, uncond=uncond, cond_scale=cond_scale, model_options=model_options, seed=seed)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 267, in forward
return self.apply_model(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 264, in apply_model
out = sampling_function(self.inner_model, x, timestep, uncond, cond, cond_scale, model_options=model_options, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 252, in sampling_function
cond, uncond = calc_cond_uncond_batch(model, cond, uncond, x, timestep, model_options)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 228, in calc_cond_uncond_batch
output = model_options['model_function_wrapper'](model.apply_model, {"input": input_x, "timestep": timestep
, "c": c, "cond_or_uncond": cond_or_uncond}).chunk(batch_chunks)
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI_stable_fast/node.py", line 69, in call
return self.stable_fast_model.get_traced_module(input_x, timestep
, **c)[0](
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI_stable_fast/module/stable_diffusion_pipeline_compiler.py", line 62, in get_traced_module
traced_m, call_helper = trace_with_kwargs(
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 23, in trace_with_kwargs
traced_module = better_trace(TraceablePosArgOnlyModuleWrapper(func),
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/utils.py", line 29, in better_trace
script_module = torch.jit.trace(func, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/jit/_trace.py", line 798, in trace
return trace_module(
File "/opt/conda/lib/python3.10/site-packages/torch/jit/_trace.py", line 1065, in trace_module
module._c._create_method_from_trace(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 127, in forward
outputs = self.module(*orig_args, **orig_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 77, in forward
return self.func(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/model_base.py", line 68, in apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 619, in forward
h = forward_timestep_embed(module, h, emb, context, transformer_options)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 35, in forward_timestep_embed
x = layer(x, emb)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 210, in forward
return checkpoint(
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/util.py", line 121, in checkpoint
return CheckpointFunction.apply(func, len(inputs), *args)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
RuntimeError: _Map_base::at

ModuleNotFoundError: No module named 'sfast._C'

Traceback (most recent call last):
File "/workspace/webui-pakage/stable-fast-0.0.2/predict.py", line 3, in
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile,
File "/workspace/webui-pakage/stable-fast-0.0.2/sfast/init.py", line 22, in
import sfast._C as _C
ModuleNotFoundError: No module named 'sfast._C'

how to solve?

env:
A100
xformers 0.0.22.post4+cu118
triton 2.1.0
torch 2.1.0+cu118
transformers 4.34.1
ninja 1.11.1.1
diffusers 0.21.4
stable-fast 0.0.2

'torch' module not found during 'stable-fast' installation

Environment Information:

  • Python Version: 3.10.13
  • Virtual Environment: venv (confirmed)
  • torch Version: 2.1.0 (installed in the venv)
  • pip Version: 23.3.1
  • OS: Linux

Problem Description:
When trying to install the stable-fast package, I encounter a ModuleNotFoundError: No module named 'torch'. I have confirmed that torch is installed in the virtual environment using pip list. Also, no package conflicts were detected with pip check.

Attempted Solutions:

  • Reconfirmed the installation of torch in the virtual environment.
  • Tested in a different virtual environment.

Additional Information:
The same error occurs when installing directly from the GitHub repository and from PyPI.

I would appreciate any suggestions or solutions to this problem.

Tiny AutoEncoder support

If we're going for raw speed autoencoder_tiny should present a massive speedup. https://huggingface.co/docs/diffusers/api/models/autoencoder_tiny

I'm not sure exactly how this works but it seems like the issue is here:

m.vae.quant_conv.forward = lazy_trace_(m.vae.quant_conv.forward)

Commenting this line out allows my models to compile with TAE and provides approximately a 50ms speedup on my RTX 3090

Found an unsupported argument type in the JIT tracer: at::Generator.

Hi, I'm running SD 2.1, and got the following errors. How can I fix it?

  File "/zhangjun/git/stable-fast/src/sfast/jit/trace_helper.py", line 78, in forward
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/schedulers/scheduling_euler_ancestral_discrete.py", line 356, in step
    noise = randn_tensor(model_output.shape, dtype=model_output.dtype, device=device, generator=generator)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/utils/torch_utils.py", line 80, in randn_tensor
    latents = torch.randn(shape, generator=generator, device=rand_device, dtype=dtype, layout=layout).to(device)
RuntimeError: Found an unsupported argument type in the JIT tracer: at::Generator. File a bug report.

Config options as follow.

  config = CompilationConfig.Default()
  try:
      import xformers
      config.enable_xformers = True
  except ImportError:
      print('xformers not installed, skip')
  try:
      import triton
      config.enable_triton = True
      torch.backends.cuda.matmul.allow_tf32 = True
  except ImportError:
      print('Triton not installed, skip')
  config.enable_cuda_graph = True
  config.trace_scheduler = True

Inference code:

scheduler = EulerAncestralDiscreteScheduler.from_pretrained(
    model_path, subfolder="scheduler"
)
pipeline = StableDiffusionPipeline.from_pretrained(
    model_path,
    scheduler=scheduler,
    torch_dtype=self.dtype,
    safety_checker=None,
)

# compile 
...
pipeline = compile(pipeline, config)

sfast_inputs = dict(
    prompt=prompt,
    negative_prompt=neg_prompt,
    generator=torch.Generator(device='cuda').manual_seed(seed),
    width=width,
    height=height,
    num_inference_steps=num_inference_steps,
)
image = pipeline(**sfast_inputs).images

Environment Information:

  • diffusers==0.23.1
  • transformers=4.35.2
  • peft==0.6.2

Openai triton conv kernel can't run correctly

I'm running the triton version of conv from the repository under triton version 2.1.0, and single-step debugging revealed that the problem should be that I can't lanuch the conv kernel. Once I run the portion of the program that lanuch the kernel, it's automatically killed. What is the problem?
20231120-113647

[Enhancement] Contribution to A100 numbers

Hey!

I have access to A100s, and would like to submit the results from A100. Let me know if you have some sort of benchmark script up and ready so that I can use it to run and get the numbers.

Black image with relelases post1-post3 and nightly

Releases after 12.0, aka 12.0 post1-12.0 post 3 and the nightly generate black images with the same dependencies as 12.0
Using the NV CUDA 12.1 container as base:
requirements.txt
diffusers==0.23.1
xformers
torch==2.1.0
nvidia-pytriton==0.4.1
numpy==1.26.2
triton==2.1.0
requests
transformers==4.35.2
tokenizers==0.15.0
https://github.com/chengzeyi/stable-fast/releases/download/v0.0.12/stable_fast-0.0.12_torch210_cu121-cp310-cp310-manylinux2014_x86_64.whl

Breakdown of individual optimizations

First of all, congrats on the release of such an amazing project!! Reading through the list of individual optimizations / 'features', I am super curious about how much each one of them contribute to the total performance increase and whether there is an reasonably smaller set of operations that can yield 85-90% of the whole perf while reducing the surface area by a lot. While developing stable fast, even though they might be unconfirmed, have any numbers been collected between new optimization passes?

Optimizing Controlnets independently?

First, awesome work. The speed improvement is great, and the additional cold start time is very manageable.

Second, I load many pipelines for each model, and do things like swap out controlnets and vaes in response to user requests. When I swap controlnets and run a generation, I am getting the error RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float. I'm guessing this is because the replacement controlnet hasn't had the same optimizations as the original. Is there a way to compile just the controlnets so that they can be swapped in and out of the pipeline dynamically?

Does xformers still matter?

Hi, I was wondering since you have done extensive benchmarking you probably have details on this. It seems that you have built-in support for xformers in there, but I thought pytorch 2.0 introduces an equivalent mechanism built-in? Should I still care about xformers? Cheers

RuntimeError: no valid convolution algorithms available in CuDNN

The following problem occurred when I call compile(pipe)


Traceback (most recent call last):

File "svd_sf.py", line 49, in

frames = pipe(image, decode_chunk_size=7, num_frames=20).frames[0]

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

return func(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py", line 499, in call

noise_pred = self.unet(

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 40, in dynamic_graphed_callable

cached_callable = simple_make_graphed_callable(

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 61, in simple_make_graphed_callable

return make_graphed_callable(func,

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 90, in make_graphed_callable

func(*tree_copy(example_inputs, detach=True),

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/jit/trace_helper.py", line 64, in wrapper

return traced_module(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/jit/trace_helper.py", line 133, in forward

outputs = self.module(*self.convert_inputs(args, kwargs))

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

RuntimeError: The following operation failed in the TorchScript interpreter.

Traceback of TorchScript (most recent call last):

graph(%1, %2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15):

%x = sfast::cudnn_convolution_bias_add(%1, %2, %3, %14, %15, %4, %5, %6, %7, %8, %9)
     ~~~~~ <--- HERE
return (%x)

RuntimeError: no valid convolution algorithms available in CuDNN


I have cudnn installed on my server. torch.backends.cudnn.is_available() and torch.backends.cudnn.enabled show True


Updated: I successfully ran your example in README.md. Currently I'm trying to accelerate stable video diffusion, which involves very large matmul. So possibly this is the reason?

Compatible with comfy ui?

Hi! I saw your message about your model. What is your model based on? Do you have a version compatible with comfy ui or draw things on Mac? Can it run on Mac MPS?

Is this package compatible with Lycoris LoRAs

Hello, I see that in the read me that stable-fast supports LoRAs out of the box. Does anyone know if it's possible to use Lycoris LoRAs with this package? The Nvidia TensorRT extension for A1111 seems have questionable support for them.

Running Speed is Slower for SDXL Model

Hi, another issue that I found is it's not accelerating SDXL. I'm running the demo with A100, the speed of SDXL for the compiled model is 5.3 it/s but with normal diffusers it's 8.8 it/s. The compiled one with stable-fast is slower.

exception when using deterministic generation with enable_cuda_graph = True

when do deterministic generation by passing a torch.Generator to the pipeline, stable-fast raised an AssertionError

Traceback (most recent call last):
  File "/data/dbgsfast/dbgsfast/__init__.py", line 43, in <module>
    model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(41))
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 958, in __call__
    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 32, in dynamic_graphed_callable
    return cached_callable(*args, **kwargs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 143, in functionalized
    return _graphed_module(*user_args, **user_kwarg_args)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 130, in forward
    outputs = self._forward(*inputs, **kwarg_inputs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 136, in _forward
    tree_copy_(static_kwarg_inputs, kwarg_inputs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/utils/copy.py", line 20, in tree_copy_
    tree_copy_(dest[k], src[k])
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/utils/copy.py", line 22, in tree_copy_
    assert dest == src

code to reproduce

import torch
from diffusers import StableDiffusionPipeline, EulerAncestralDiscreteScheduler
from sfast.compilers.stable_diffusion_pipeline_compiler import (
    compile,
    CompilationConfig,
)


def load_model():
    model = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
    )

    model.scheduler = EulerAncestralDiscreteScheduler.from_config(
        model.scheduler.config
    )
    model.safety_checker = None
    model.to(torch.device("cuda"))
    return model


model = load_model()

config = CompilationConfig.Default()
config.enable_cuda_graph = True

model = compile(model, config)

kwarg_inputs = dict(
    prompt="(masterpiece:1,2), best quality, masterpiece, best detail face, a beautiful girl",
    num_inference_steps=30,
    num_images_per_prompt=1,
    height=512,
    width=512,
)

model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(42))

model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(41))

Fails to compile model in docker, missing installs?

Hello, I was able to get this repo working with sdxl turbo on my 4090 using a venv, but I tried to dockerize the build and it repeatedly fails on a missing tmp file. I am wonder if you have seen this issue before and if I perhaps am missing some apt or pip installs that are missing in my docker image causing this issue. Appreciate your help @chengzeyi.

Dockerfile:
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked
--mount=type=cache,target=/var/lib/apt,sharing=locked
apt-get -y update
&& apt-get install -y --no-install-recommends python3.10 python-is-python3 git libgl1 libsndfile1 pip ffmpeg google-perftools
libvulkan1 libnvidia-gl-525-server mesa-vulkan-drivers gcc build-essential
&& apt-get autoremove -y
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip setuptools wheel --no-cache-dir

WORKDIR testing
COPY requirements2.txt requirements2.txt
RUN pip install -r requirements2.txt --no-cache-dir

COPY . .

CMD ["python3", "examples/optimize_stable_diffusion_pipeline.py"]

Error logs:

INFO:root:Tracing forward
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:159: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:218: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:228: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:214: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:23: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:253: TracerWarning: torch.Tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return super().new(cls, x, *args, **kwargs)
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:123: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
0%| | 0/30 [00:00<?, ?it/s]INFO:root:Dynamically graphing forward
/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py:88: UserWarning: The CUDA Graph is empty. This usually means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:192.)
super().capture_end()
INFO:root:Tracing forward
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:197: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bool(tensors[start].item()), start + 1
/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py:878: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if dim % default_overall_up_factor != 0:
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:271: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:173: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:186: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if hidden_states.shape[0] >= 64:
/tmp/tmpuumtrxyy/main.c:4:10: fatal error: Python.h: No such file or directory
4 | #include <Python.h>
| ^~~~~~~~~~
compilation terminated.
0%| | 0/30 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/testing/examples/optimize_stable_diffusion_pipeline.py", line 81, in
output_image = model(**kwarg_inputs).images[0]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 918, in call
noise_pred = self.unet(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 29, in dynamic_graphed_callable
cached_callable = simple_make_graphed_callable(
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 46, in simple_make_graphed_callable
return make_graphed_callable(callable,
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 75, in make_graphed_callable
callable(*tree_copy(example_inputs),
File "/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py", line 55, in wrapper
return traced_module(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py", line 112, in forward
outputs = self.module(*self.convert_inputs(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):

graph(%input, %num_groups, %weight, %bias, %eps, %cudnn_enabled):
%y : Tensor = sfast_triton::group_norm_silu(%input, %num_groups, %weight, %bias, %eps)
~~~~~~~~~~~~ <--- HERE
return (%y)
RuntimeError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpuumtrxyy/main.c', '-O3', '-I/usr/local/lib/python3.10/dist-packages/triton/common/../third_party/cuda/include', '-I/usr/include/python3.10', '-I/tmp/tmpuumtrxyy', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpuumtrxyy/group_norm_4d_channels_last_forward_collect_stats_kernel.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib/x86_64-linux-gnu']' returned non-zero exit status 1.

At:
/usr/lib/python3.10/subprocess.py(369): check_call
/usr/local/lib/python3.10/dist-packages/triton/common/build.py(90): _build
/usr/local/lib/python3.10/dist-packages/triton/compiler/make_launcher.py(39): make_stub
/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py(425): compile
(63): group_norm_4d_channels_last_forward_collect_stats_kernel
/usr/local/lib/python3.10/dist-packages/sfast/triton/init.py(35): new_func
/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py(232): run
/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py(232): run
/usr/local/lib/python3.10/dist-packages/sfast/triton/ops/group_norm.py(437): group_norm_forward
/usr/local/lib/python3.10/dist-packages/sfast/triton/torch_ops.py(186): forward
/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py(539): apply
/usr/local/lib/python3.10/dist-packages/sfast/triton/torch_ops.py(224): group_norm_silu
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py(112): forward
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py(55): wrapper
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(75): make_graphed_callable
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(46): simple_make_graphed_callable
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(29): dynamic_graphed_callable
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(918): call
/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(115): decorate_context
/testing/examples/optimize_stable_diffusion_pipeline.py(81):

Pip Freeze:
accelerate==0.25.0
annotated-types==0.6.0
anyio==3.7.1
certifi==2023.11.17
charset-normalizer==3.3.2
click==8.1.7
diffusers==0.24.0
exceptiongroup==1.2.0
fastapi==0.104.1
filelock==3.13.1
fsspec==2023.12.0
h11==0.14.0
huggingface-hub==0.19.4
idna==3.6
importlib-metadata==7.0.0
Jinja2==3.1.2
MarkupSafe==2.1.3
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
packaging==23.2
Pillow==10.1.0
psutil==5.9.6
pydantic==2.5.2
pydantic_core==2.14.5
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
safetensors==0.4.1
sniffio==1.3.0
stable-fast @ https://github.com/chengzeyi/stable-fast/releases/download/v0.0.12.post6/stable_fast-0.0.12.post6+torch210cu121-cp310-cp310-manylinux2014_x86_64.whl
starlette==0.27.0
sympy==1.12
tokenizers==0.15.0
torch==2.1.0
tqdm==4.66.1
transformers==4.35.2
triton==2.1.0
typing_extensions==4.8.0
urllib3==2.1.0
uvicorn==0.24.0.post1
xformers==0.0.22.post7
zipp==3.17.0

SDNext Support

Hi @chengzeyi, I sent an email to you last week but I wanted to try again, we'd love to get in touch with you since Vlad has now applied your code to SDNext, we're seeing great results on our dev branch.
Hit us up on our discord server or just reply to my email so we can connect, as we plan on posting our results.

Thanks!

What's the advance compared to TensorRT?

Thank you for this great work. It's amazing that it could reach almost same performance with TensorRT on N-GPU!

However, is there a convincing reason of using it instead of TensorRT?

Is it compatible with Torch Nightly versions?

I can not build Xformers from source. It crashes with OOM errors.

Installing stable-fast repository also results in erros when running setup.py.

Is there a workaround for this? It runs with torch2.1 but fails with nightly versions.

RuntimeError for Demo Test

Hi,

Thanks for your work. I am running the demo test but having this issue when running the inference after successful compile:

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):

graph(%1, %2):
    %x : Tensor = sfast_triton::reshape(%1, %2)
                  ~~~~~~~~~~~~ <--- HERE
    return (%x)
RuntimeError: RuntimeError: shape '[512, 512, 64, 64]' is invalid for input of size 2097152

At:
  /opt/sd/lib/python3.10/site-packages/torch/_ops.py(502): __call__
  /opt/sd/lib/python3.10/site-packages/sfast/triton/torch_ops.py(82): forward
  /opt/sd/lib/python3.10/site-packages/torch/autograd/function.py(506): apply
  /opt/sd/lib/python3.10/site-packages/sfast/triton/torch_ops.py(97): reshape
  /opt/sd/lib/python3.10/site-packages/torch/nn/modules/module.py(1501): _call_impl
  /opt/sd/lib/python3.10/site-packages/sfast/jit/trace_helper.py(111): forward
  /opt/sd/lib/python3.10/site-packages/torch/nn/modules/module.py(1501): _call_impl
  /opt/sd/lib/python3.10/site-packages/sfast/jit/trace_helper.py(57): wrapper
  /opt/sd/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(707): __call__
  /opt/sd/lib/python3.10/site-packages/torch/utils/_contextlib.py(115): decorate_context
  /hostroot/experiments/sfast/test_sfast.py(70): <module>

Would you help to have a look? Thanks!

Problem with SDXL compilation

Hi. Thanks for the great repo. Tested it on SD1.5, the speed increase is really impressive. However, there are problems with SDXL. I'm trying to run it in Google Colab. Here's what I'm doing:

!pip install -q diffusers transformers accelerate omegaconf ninja
!pip install -q https://download.pytorch.org/whl/cu118/torch-2.1.0%2Bcu118-cp310-cp310-linux_x86_64.whl
!pip install -q https://download.pytorch.org/whl/triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
!pip install -q https://download.pytorch.org/whl/cu118/xformers-0.0.22.post4%2Bcu118-cp310-cp310-manylinux2014_x86_64.whl
!pip install -q https://github.com/camenduru/stable-fast/releases/download/colab/stable_fast-0.0.2-cp310-cp310-linux_x86_64.whl

next init and compile model

from sfast.compilers.stable_diffusion_pipeline_compiler import (compile, CompilationConfig)
from diffusers import DiffusionPipeline
import torch, xformers, triton

torch.backends.cuda.matmul.allow_tf32 = True
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda")

config = CompilationConfig.Default()
config.enable_xformers = True
config.enable_triton = True
config.enable_cuda_graph = True
compiled_pipe = compile(pipe, config)

I end up trying to run an inference

prompt = 'a car'
h, w = 1024, 1024
steps = 30
seed = 42
guidance_scale = 7
num_images = 1

image = pipe(
    prompt = prompt,
    height = h,
    width = w,
    num_inference_steps = steps,
    guidance_scale = guidance_scale,
    num_images_per_prompt = num_images,
    generator = torch.Generator(device='cuda').manual_seed(seed),
).images

I get the following error

[/usr/local/lib/python3.10/dist-packages/torch/jit/_trace.py](https://localhost:8080/#) in trace_module(mod, inputs, optimize, check_trace, check_inputs, check_tolerance, strict, _force_outplace, _module_class, _compilation_unit, example_inputs_is_kwarg, _store_inputs)
   1063             else:
   1064                 example_inputs = make_tuple(example_inputs)
-> 1065                 module._c._create_method_from_trace(
   1066                     method_name,
   1067                     func,

RuntimeError: Tracer cannot infer type of BaseModelOutputWithPooling(last_hidden_state=tensor([[[-0.3884,  0.0229, -0.0523,  ..., -0.4902, -0.3066,  0.0674],

...

device='cuda:0', dtype=torch.float16)), attentions=None)
:Dictionary inputs to traced functions must have consistent type. Found Tensor and Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]

Distribute pre-compiled wheels

Since the setup is very peculiar and computationally intensive (and also requires cudnn/cublas on the target machine), would it make sense to distribute/attach wheels with releases?

Potential improvements to stable-fast

Hi chengzeyi,
Wanted to first off congratulate you on this awesome work! I have actually also been working on a similar project here but I have recently stopped development since your project has already been widely adopted. However, there are some features that I have been working on that I believe could enhance stable-fast

  • Using torch.fx to overwrite and accelerate unet
  • Supporting tensor parallelism for people with more than 1 gpu
  • INT4 quantization with GPTQ
  • Sparse inference

I would love to work with you on the above topics if possible since I have already partially implemented quite a few of these! Please let me know if you could see us collaborating in the future..

RuntimeError: Freezing is currently only implemented for modules in eval mode

Any ideas? I've found barely any mention of what freezing is or does.
It didn't do this with the runwayml/stable-diffusion-v1-5 model in the example.
Is the .ckpt file format required?

```You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at huggingface/diffusers#254 .

/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/configuration_utils.py:134: FutureWarning: Accessing config attribute requires_safety_checker directly via 'StableDiffusionPipeline' object attribute is deprecated. Please access 'requires_safety_checker' over 'StableDiffusionPipeline's config object instead, e.g. 'scheduler.config.requires_safety_checker'.
deprecate("direct config name access", "1.0.0", deprecation_message, standard_warn=False)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:750: FutureWarning: torch_dtype is deprecated and will be removed in version 0.25.0.
deprecate("torch_dtype", "0.25.0", "")
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:157: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:216: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:226: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:212: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:21: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:251: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([obj_id], dtype=torch.int64)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:121: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
Traceback (most recent call last):
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/stable-fast-exp.py", line 73, in
output_image = compiled_model(**kwarg_inputs).images[0]
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 756, in call
prompt_embeds, negative_prompt_embeds = self.encode_prompt(
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 352, in encode_prompt
prompt_embeds = self.text_encoder(text_input_ids.to(device), attention_mask=attention_mask)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 51, in wrapper
traced_m = ts_compiler(traced_m, call_helper, args,
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/compilers/stable_diffusion_pipeline_compiler.py", line 82, in ts_compiler
m = jit_utils.better_freeze(m)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/jit/utils.py", line 35, in better_freeze
freezed_module = torch.jit.freeze(script_module, *args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/jit/_freeze.py", line 108, in freeze
raise RuntimeError(
RuntimeError: Freezing is currently only implemented for modules in eval mode. Please call .eval() on your module before freezing.```

Does it support LCM Lora? The generated images are very poor

I use stable-fast ==0.0.13.post3 to test a lcm lora, the result like this
output_lcm
but to use lcm lora in pure diffsuers is ok
output2

111

my code like this:

import torch
from diffusers import LCMScheduler, AutoPipelineForText2Image, DiffusionPipeline
from sfast.compilers.stable_diffusion_pipeline_compiler import (
    compile, CompilationConfig)
import numpy as np
from PIL import Image

base_model_path = "runwayml/stable-diffusion-v1-5"
lcm_path = "latent-consistency/lcm-lora-sdv1-5"


def load_model():
    model = DiffusionPipeline.from_pretrained(base_model_path,
                                              torch_dtype=torch.float16,
                                              safety_checker=None,
                                              use_safetensors=True)

    model.scheduler = LCMScheduler.from_config(model.scheduler.config)
    model.safety_checker = None
    model.to(torch.device('cuda'))
    #model.unet.load_attn_procs(lcm_path)
    model.load_lora_weights(lcm_path)
    model.fuse_lora()
    return model


def compile_model(model):
    config = CompilationConfig.Default()

    # xformers and Triton are suggested for achieving best performance.
    # It might be slow for Triton to generate, compile and fine-tune kernels.
    try:
        import xformers
        config.enable_xformers = True
    except ImportError:
        print('xformers not installed, skip')
    # NOTE:
    # When GPU VRAM is insufficient or the architecture is too old, Triton might be slow.
    # Disable Triton if you encounter this problem.
    try:
        import triton
        config.enable_triton = True
    except ImportError:
        print('Triton not installed, skip')
    # NOTE:
    # CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
    # My implementation can handle dynamic shape with increased need for GPU memory.
    # But when your GPU VRAM is insufficient or the image resolution is high,
    # CUDA Graph could cause less efficient VRAM utilization and slow down the inference,
    # especially when on Windows or WSL which has the "shared VRAM" mechanism.
    # If you meet problems related to it, you should disable it.
    config.enable_cuda_graph = True

    model = compile(model, config)
    return model


def main():
    prompt = "a rendering of a living room with a couch and a tv"
    negative_prompt = "ugly,logo,pixelated,lowres,text,word,cropped,low quality,normal quality,username,watermark,signature,blurry,soft,NSFW,painting,cartoon,hang,occluded objects,Fisheye View"

    model = load_model()
    model = compile_model(model)

    kwarg_inputs = dict(
        prompt=prompt,
        negative_prompt=negative_prompt,
        width=768,
        height=512,
        num_inference_steps=7,
        num_images_per_prompt=1,
        guidance_scale=1.5,
    )

    # NOTE: Warm it up.
    # The initial calls will trigger compilation and might be very slow.
    # After that, it should be very fast.
    for _ in range(3):
        output_image = model(**kwarg_inputs).images[0]

    # Let's see it!
    # Note: Progress bar might work incorrectly due to the async nature of CUDA.

    img_total = []
    for i in range(2):
        output_image = model(
            prompt=prompt,
            negative_prompt=negative_prompt,
            width=768,
            height=512,
            num_inference_steps=7,
            num_images_per_prompt=6,
            # generator=generators
        ).images

        img_row = []
        for img in output_image:
            img_row.append(np.asarray(img))
        img = np.hstack(img_row)
        img_total.append(img)
    image = np.vstack(img_total)
    # cv2.putText(image,prompt,(40,50),cv2.FONT_HERSHEY_SIMPLEX,2,(0,0,255),3)

    image = Image.fromarray(image)
    image.save("./output_lcm.png")


if __name__ == '__main__':
    main()

How to input fixed latents to the model during inference

Hi, I would like to know how to input fixed latents to the model during inference. Because I need fixed latents to avoid random noise for comparing the results of stable fast and PyTorch. I have set the input parameters to the following format, but it doesn't work.

`latents = np.load('latents.npy', mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')
latents = torch.from_numpy(latents).half().cuda()

kwarg_inputs = dict(
prompt = prompt,
latents = latents,
height=512,
width=512,
num_inference_steps=20,
num_images_per_prompt=1,
)
output_image = compiled_model(**kwarg_inputs).images[0]

SDXL Swap lora Issue

Hey

I am trying to swap lora of the compiled model with the sample code given in readme. I get this error

image

When I try to replace the weights myself, I am getting very bad outputs

My snippet of trying to swap weights

state_dict = pipe.unet.state_dict()
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy_face")
update_state_dict(state_dict, pipe.unet.state_dict())
pipe.unet.load_state_dict(state_dict, assign=True)

Then infer.

Compilation errors

Trying to compile on my machine results in:

FAILED: /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o.d -DWITH_CUDA -I/home/dirkson/ai/depend/stable-fast/sfast/csrc -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/TH -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/dirkson/ai/include -I/usr/include/python3.11 -c -c /home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17

/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu(1210): error: no suitable user-defined conversion from "at::IntArrayRef" to "c10::SymIntArrayRef" exists
input_r, weight_r, bias_opt, stride_, fromIntArrayRefUnchecked(padding_),
^

/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu(1211): error: no suitable user-defined conversion from "at::IntArrayRef" to "c10::SymIntArrayRef" exists
dilation_, transposed_, fromIntArrayRefUnchecked(output_padding_),

My stack is fairly up-to-date, more so than standard, I think. Python 3.11.5 with pytorch nightlies. nvcc/cuda versioning:

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Let me know is there's other versioning that might help! My knowledge of C++ isn't good enough that I can really grok this error message, otherwise you'd be getting a PR rather than an issue. Sorry!

Cheers!

Stable Video Optimizations?

Do you think any of these optimizations could be applied to Stable Video Diffusion? I'd like to help here if possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.