chengzeyi / stable-fast Goto Github PK

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

License: MIT License

Python 56.87% C++ 37.92% Cuda 5.21%

cuda diffusers pytorch stable-diffusion deeplearnng inference-engines openai-triton performance-optimizations torch stable-video-diffusion

stable-fast's Issues

[composability] stable-fast + sd-turbo device mismatch

2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2023-12-07 17:35:10.128 [stderr   ]     return func(*args, **kwargs)
2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py", line 926, in __call__
2023-12-07 17:35:10.128 [stderr   ]     image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 36, in dynamic_graphed_callable
2023-12-07 17:35:10.128 [stderr   ]     cached_callable = simple_make_graphed_callable(
2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 53, in simple_make_graphed_callable
2023-12-07 17:35:10.128 [stderr   ]     return make_graphed_callable(callable,
2023-12-07 17:35:10.128 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 98, in make_graphed_callable
2023-12-07 17:35:10.129 [stderr   ]     static_inputs = shadow_copy(static_inputs_)
2023-12-07 17:35:10.129 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 49, in shadow_copy
2023-12-07 17:35:10.129 [stderr   ]     return type(obj)(shadow_copy(x, detach=detach) for x in obj)
2023-12-07 17:35:10.129 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 49, in <genexpr>
2023-12-07 17:35:10.129 [stderr   ]     return type(obj)(shadow_copy(x, detach=detach) for x in obj)
2023-12-07 17:35:10.129 [stderr   ]   File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 45, in shadow_copy
2023-12-07 17:35:10.129 [stderr   ]     return sfast._C._create_shadow_tensor(
2023-12-07 17:35:10.129 [stderr   ] RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

exception when using deterministic generation with enable_cuda_graph = True

when do deterministic generation by passing a torch.Generator to the pipeline, stable-fast raised an AssertionError

Traceback (most recent call last):
  File "/data/dbgsfast/dbgsfast/__init__.py", line 43, in <module>
    model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(41))
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 958, in __call__
    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 32, in dynamic_graphed_callable
    return cached_callable(*args, **kwargs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 143, in functionalized
    return _graphed_module(*user_args, **user_kwarg_args)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 130, in forward
    outputs = self._forward(*inputs, **kwarg_inputs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 136, in _forward
    tree_copy_(static_kwarg_inputs, kwarg_inputs)
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/utils/copy.py", line 20, in tree_copy_
    tree_copy_(dest[k], src[k])
  File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/utils/copy.py", line 22, in tree_copy_
    assert dest == src

code to reproduce

import torch
from diffusers import StableDiffusionPipeline, EulerAncestralDiscreteScheduler
from sfast.compilers.stable_diffusion_pipeline_compiler import (
    compile,
    CompilationConfig,
)


def load_model():
    model = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
    )

    model.scheduler = EulerAncestralDiscreteScheduler.from_config(
        model.scheduler.config
    )
    model.safety_checker = None
    model.to(torch.device("cuda"))
    return model


model = load_model()

config = CompilationConfig.Default()
config.enable_cuda_graph = True

model = compile(model, config)

kwarg_inputs = dict(
    prompt="(masterpiece:1,2), best quality, masterpiece, best detail face, a beautiful girl",
    num_inference_steps=30,
    num_images_per_prompt=1,
    height=512,
    width=512,
)

model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(42))

model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(41))

[Enhancement] Contribution to A100 numbers

Hey!

I have access to A100s, and would like to submit the results from A100. Let me know if you have some sort of benchmark script up and ready so that I can use it to run and get the numbers.

Compatible with comfy ui?

Hi! I saw your message about your model. What is your model based on? Do you have a version compatible with comfy ui or draw things on Mac? Can it run on Mac MPS?

RuntimeError: Freezing is currently only implemented for modules in eval mode

Any ideas? I've found barely any mention of what freezing is or does.
It didn't do this with the runwayml/stable-diffusion-v1-5 model in the example.
Is the .ckpt file format required?

```You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at huggingface/diffusers#254 .

/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/configuration_utils.py:134: FutureWarning: Accessing config attribute requires_safety_checker directly via 'StableDiffusionPipeline' object attribute is deprecated. Please access 'requires_safety_checker' over 'StableDiffusionPipeline's config object instead, e.g. 'scheduler.config.requires_safety_checker'.
deprecate("direct config name access", "1.0.0", deprecation_message, standard_warn=False)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:750: FutureWarning: torch_dtype is deprecated and will be removed in version 0.25.0.
deprecate("torch_dtype", "0.25.0", "")
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:157: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:216: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:226: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:212: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:21: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:251: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([obj_id], dtype=torch.int64)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:121: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
Traceback (most recent call last):
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/stable-fast-exp.py", line 73, in
output_image = compiled_model(**kwarg_inputs).images[0]
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 756, in call
prompt_embeds, negative_prompt_embeds = self.encode_prompt(
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 352, in encode_prompt
prompt_embeds = self.text_encoder(text_input_ids.to(device), attention_mask=attention_mask)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 51, in wrapper
traced_m = ts_compiler(traced_m, call_helper, args,
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/compilers/stable_diffusion_pipeline_compiler.py", line 82, in ts_compiler
m = jit_utils.better_freeze(m)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/jit/utils.py", line 35, in better_freeze
freezed_module = torch.jit.freeze(script_module, *args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/jit/_freeze.py", line 108, in freeze
raise RuntimeError(
RuntimeError: Freezing is currently only implemented for modules in eval mode. Please call .eval() on your module before freezing.```

What's the advance compared to TensorRT?

Thank you for this great work. It's amazing that it could reach almost same performance with TensorRT on N-GPU!

However, is there a convincing reason of using it instead of TensorRT?

Does it work for SD2.1 based models?

setup.py should not contain import of torch

If env e.g. docker has no installation of torch yet, this setup will fail.

Rather, torch should be imported lazily and the asserts performed only if torch is available

Out of curiosity, what's the performance compared to torch.compile?

I think torch.compile should work on HF diffusers: https://huggingface.co/docs/diffusers/optimization/torch2.0#a100-batch-size-1

Trying to get it working on Windows

I've been trying to get it to work on Windows, but I'm currently stuck at this import error of some DLL file it seems. Does anyone know how to fix this issue, or some tips to try and get it working on Windows?

Traceback (most recent call last):
  File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\nodes.py", line 1800, in load_custom_node
    module_spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast\__init__.py", line 1, in <module>    from .node import ApplyStableFastUnet
  File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast\node.py", line 2, in <module>
    from sfast.compilers.stable_diffusion_pipeline_compiler import CompilationConfig
  File "C:\Users\USER\ComfyUI_windows_portable\python_embeded\Lib\site-packages\sfast\__init__.py", line 23, in <module>
    import sfast._C as _C
ImportError: DLL load failed while importing _C: The specified module could not be found.

Cannot import C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast module for custom nodes: DLL load failed while importing _C: The specified module could not be found.

Tiny AutoEncoder support

If we're going for raw speed autoencoder_tiny should present a massive speedup. https://huggingface.co/docs/diffusers/api/models/autoencoder_tiny

I'm not sure exactly how this works but it seems like the issue is here:

stable-fast/sfast/compilers/stable_diffusion_pipeline_compiler.py

Line 99 in 942769a

m.vae.quant_conv.forward = lazy_trace_(m.vae.quant_conv.forward)

Commenting this line out allows my models to compile with TAE and provides approximately a 50ms speedup on my RTX 3090

Did not work until I added this flag: `--disable-cuda-malloc`. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

          Did not work until I added this flag: `--disable-cuda-malloc`. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)

Originally posted by @Soumyajit in gameltb/ComfyUI_stable_fast#1 (comment)

Compilation errors

Trying to compile on my machine results in:

FAILED: /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o.d -DWITH_CUDA -I/home/dirkson/ai/depend/stable-fast/sfast/csrc -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/TH -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/dirkson/ai/include -I/usr/include/python3.11 -c -c /home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17

/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu(1210): error: no suitable user-defined conversion from "at::IntArrayRef" to "c10::SymIntArrayRef" exists
input_r, weight_r, bias_opt, stride_, fromIntArrayRefUnchecked(padding_),
^

/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu(1211): error: no suitable user-defined conversion from "at::IntArrayRef" to "c10::SymIntArrayRef" exists
dilation_, transposed_, fromIntArrayRefUnchecked(output_padding_),

My stack is fairly up-to-date, more so than standard, I think. Python 3.11.5 with pytorch nightlies. nvcc/cuda versioning:

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Let me know is there's other versioning that might help! My knowledge of C++ isn't good enough that I can really grok this error message, otherwise you'd be getting a PR rather than an issue. Sorry!

Cheers!

SDXL Swap lora Issue

Hey

I am trying to swap lora of the compiled model with the sample code given in readme. I get this error

When I try to replace the weights myself, I am getting very bad outputs

My snippet of trying to swap weights

state_dict = pipe.unet.state_dict()
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy_face")
update_state_dict(state_dict, pipe.unet.state_dict())
pipe.unet.load_state_dict(state_dict, assign=True)

Then infer.

Update torch.compile benchmark on A100 40GB SDv1.5 for torch nightly

100%|██████████| 50/50 [00:00<00:00, 58.00it/s]

58 Iterations per second. (with torch.cuda.synchronize)

Settings:

A100 40GB, fp16, batch size 1
height=512,
width=512,
num_inference_steps=50,
num_images_per_prompt=1,

Wall clock time for full pipeline: 862ms. (no torch.cuda.synchronize). 927ms (with torch.cuda.synchronize)

torch version: torch==2.2.0.dev20231203+cu121

Also, please update wallclock time and iteration/s for other GPUs on torch nightly. If extrapolation serves us right, it ought to be faster than TensorRT.

Outdated

2023-12-04 03:24:33.549 [stderr   ] 100%|██████████| 50/50 [00:00<00:00, 79.00it/s]

79 Iterations per second. (naive measurement)

Wall clock time for full pipeline: 862ms.

torch eager and not channels first: wallclock time: 1763ms

100%|██████████| 50/50 [00:01<00:00, 29.55it/s]

(naive measurement)

mempool in sfast.cuda.graphs._per_device_execution_envs has been released

          for a simple workflow like this

If you run with one model and then switch to another, pytorch will fail at here
This is because the mempool in sfast.cuda.graphs._per_device_execution_envs references graphs has been released. Should we make it per model?

Originally posted by @gameltb in gameltb/ComfyUI_stable_fast#1 (comment)

Distribute pre-compiled wheels

Since the setup is very peculiar and computationally intensive (and also requires cudnn/cublas on the target machine), would it make sense to distribute/attach wheels with releases?

Does it support LCM Lora? The generated images are very poor

I use stable-fast ==0.0.13.post3 to test a lcm lora, the result like this

but to use lcm lora in pure diffsuers is ok

my code like this:

import torch
from diffusers import LCMScheduler, AutoPipelineForText2Image, DiffusionPipeline
from sfast.compilers.stable_diffusion_pipeline_compiler import (
    compile, CompilationConfig)
import numpy as np
from PIL import Image

base_model_path = "runwayml/stable-diffusion-v1-5"
lcm_path = "latent-consistency/lcm-lora-sdv1-5"


def load_model():
    model = DiffusionPipeline.from_pretrained(base_model_path,
                                              torch_dtype=torch.float16,
                                              safety_checker=None,
                                              use_safetensors=True)

    model.scheduler = LCMScheduler.from_config(model.scheduler.config)
    model.safety_checker = None
    model.to(torch.device('cuda'))
    #model.unet.load_attn_procs(lcm_path)
    model.load_lora_weights(lcm_path)
    model.fuse_lora()
    return model


def compile_model(model):
    config = CompilationConfig.Default()

    # xformers and Triton are suggested for achieving best performance.
    # It might be slow for Triton to generate, compile and fine-tune kernels.
    try:
        import xformers
        config.enable_xformers = True
    except ImportError:
        print('xformers not installed, skip')
    # NOTE:
    # When GPU VRAM is insufficient or the architecture is too old, Triton might be slow.
    # Disable Triton if you encounter this problem.
    try:
        import triton
        config.enable_triton = True
    except ImportError:
        print('Triton not installed, skip')
    # NOTE:
    # CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
    # My implementation can handle dynamic shape with increased need for GPU memory.
    # But when your GPU VRAM is insufficient or the image resolution is high,
    # CUDA Graph could cause less efficient VRAM utilization and slow down the inference,
    # especially when on Windows or WSL which has the "shared VRAM" mechanism.
    # If you meet problems related to it, you should disable it.
    config.enable_cuda_graph = True

    model = compile(model, config)
    return model


def main():
    prompt = "a rendering of a living room with a couch and a tv"
    negative_prompt = "ugly,logo,pixelated,lowres,text,word,cropped,low quality,normal quality,username,watermark,signature,blurry,soft,NSFW,painting,cartoon,hang,occluded objects,Fisheye View"

    model = load_model()
    model = compile_model(model)

    kwarg_inputs = dict(
        prompt=prompt,
        negative_prompt=negative_prompt,
        width=768,
        height=512,
        num_inference_steps=7,
        num_images_per_prompt=1,
        guidance_scale=1.5,
    )

    # NOTE: Warm it up.
    # The initial calls will trigger compilation and might be very slow.
    # After that, it should be very fast.
    for _ in range(3):
        output_image = model(**kwarg_inputs).images[0]

    # Let's see it!
    # Note: Progress bar might work incorrectly due to the async nature of CUDA.

    img_total = []
    for i in range(2):
        output_image = model(
            prompt=prompt,
            negative_prompt=negative_prompt,
            width=768,
            height=512,
            num_inference_steps=7,
            num_images_per_prompt=6,
            # generator=generators
        ).images

        img_row = []
        for img in output_image:
            img_row.append(np.asarray(img))
        img = np.hstack(img_row)
        img_total.append(img)
    image = np.vstack(img_total)
    # cv2.putText(image,prompt,(40,50),cv2.FONT_HERSHEY_SIMPLEX,2,(0,0,255),3)

    image = Image.fromarray(image)
    image.save("./output_lcm.png")


if __name__ == '__main__':
    main()

Is it possible to save the compiled model?

Invert FORCE_CUDA=1 or improve install from source to fail if cuda extensions can't be built

So I was debugging why the deployed model inference was slower than in a Jupyter Notebook.
The reason was that the CI that is building the container image did not build the cuda extensions (because of this).

It would probably be better to invert this logic to throw an error if It cannot build the cuda extensions.
Otherwise many users might not understand why their performance is worse than what should be possible.

This could be bypassed with a env var like OPTIONAL_CUDA=1 or DISABLE_CUDA=1.

🦒 colab

Thanks for the project ❤️ I made a colab. 🥳 I hope you like it. https://github.com/camenduru/stable-fast-colab

Is there a way to save the compiled model to reduce loading time?

Openai triton conv kernel can't run correctly

I'm running the triton version of conv from the repository under triton version 2.1.0, and single-step debugging revealed that the problem should be that I can't lanuch the conv kernel. Once I run the portion of the program that lanuch the kernel, it's automatically killed. What is the problem?

Is it compatible with Torch Nightly versions?

I can not build Xformers from source. It crashes with OOM errors.

Installing stable-fast repository also results in erros when running setup.py.

Is there a workaround for this? It runs with torch2.1 but fails with nightly versions.

Problem with SDXL compilation

Hi. Thanks for the great repo. Tested it on SD1.5, the speed increase is really impressive. However, there are problems with SDXL. I'm trying to run it in Google Colab. Here's what I'm doing:

!pip install -q diffusers transformers accelerate omegaconf ninja
!pip install -q https://download.pytorch.org/whl/cu118/torch-2.1.0%2Bcu118-cp310-cp310-linux_x86_64.whl
!pip install -q https://download.pytorch.org/whl/triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
!pip install -q https://download.pytorch.org/whl/cu118/xformers-0.0.22.post4%2Bcu118-cp310-cp310-manylinux2014_x86_64.whl
!pip install -q https://github.com/camenduru/stable-fast/releases/download/colab/stable_fast-0.0.2-cp310-cp310-linux_x86_64.whl

next init and compile model

from sfast.compilers.stable_diffusion_pipeline_compiler import (compile, CompilationConfig)
from diffusers import DiffusionPipeline
import torch, xformers, triton

torch.backends.cuda.matmul.allow_tf32 = True
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda")

config = CompilationConfig.Default()
config.enable_xformers = True
config.enable_triton = True
config.enable_cuda_graph = True
compiled_pipe = compile(pipe, config)

I end up trying to run an inference

prompt = 'a car'
h, w = 1024, 1024
steps = 30
seed = 42
guidance_scale = 7
num_images = 1

image = pipe(
    prompt = prompt,
    height = h,
    width = w,
    num_inference_steps = steps,
    guidance_scale = guidance_scale,
    num_images_per_prompt = num_images,
    generator = torch.Generator(device='cuda').manual_seed(seed),
).images

I get the following error

[/usr/local/lib/python3.10/dist-packages/torch/jit/_trace.py](https://localhost:8080/#) in trace_module(mod, inputs, optimize, check_trace, check_inputs, check_tolerance, strict, _force_outplace, _module_class, _compilation_unit, example_inputs_is_kwarg, _store_inputs)
   1063             else:
   1064                 example_inputs = make_tuple(example_inputs)
-> 1065                 module._c._create_method_from_trace(
   1066                     method_name,
   1067                     func,

RuntimeError: Tracer cannot infer type of BaseModelOutputWithPooling(last_hidden_state=tensor([[[-0.3884,  0.0229, -0.0523,  ..., -0.4902, -0.3066,  0.0674],

...

device='cuda:0', dtype=torch.float16)), attentions=None)
:Dictionary inputs to traced functions must have consistent type. Found Tensor and Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]

SDXL inference speed up compare?

ModuleNotFoundError: No module named 'sfast._C'

Traceback (most recent call last):
File "/workspace/webui-pakage/stable-fast-0.0.2/predict.py", line 3, in
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile,
File "/workspace/webui-pakage/stable-fast-0.0.2/sfast/init.py", line 22, in
import sfast._C as _C
ModuleNotFoundError: No module named 'sfast._C'

how to solve?

env:
A100
xformers 0.0.22.post4+cu118
triton 2.1.0
torch 2.1.0+cu118
transformers 4.34.1
ninja 1.11.1.1
diffusers 0.21.4
stable-fast 0.0.2

Cant get it runnig... anyone can help please? RTX 3090

Hi, I am trying to run python3 optimize_stable_diffusion_pipeline.py
and I get this nasty error where I cant really tell much what exactly is wrong as it reffers to like everyting included here..

I am pretty sure I have correct version of stable fast, the one corresponding with my python 3.1, cuda 12.1 and torch 2.1, gcc is 11.4..
My system is RTX 3090, i7 etc... quite clean install of ubuntu. Nvidia drivers are 530.

this is my pip3 list:

pip3 list
Package Version

accelerate 0.25.0
antlr4-python3-runtime 4.9.3
apturl 0.5.2
bcrypt 3.2.0
blinker 1.4
Brlapi 0.8.3
certifi 2020.6.20
chardet 4.0.0
click 8.0.3
colorama 0.4.4
command-not-found 0.3
cryptography 3.4.8
cupshelpers 1.0
dbus-python 1.2.18
defer 1.0.6
diffusers 0.24.0
distro 1.7.0
distro-info 1.1+ubuntu0.1
duplicity 0.8.21
fasteners 0.14.1
filelock 3.13.1
fsspec 2023.12.1
future 0.18.2
httplib2 0.20.2
huggingface-hub 0.19.4
idna 3.3
importlib-metadata 4.6.4
jeepney 0.7.1
Jinja2 3.1.2
keyring 23.5.0
language-selector 0.1
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lockfile 0.12.2
louis 3.20.0
macaroonbakery 1.3.1
Mako 1.1.3
MarkupSafe 2.0.1
monotonic 1.6
more-itertools 8.10.0
mpmath 1.3.0
netifaces 0.11.0
networkx 3.2.1
numpy 1.26.2
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.0
olefile 0.46
omegaconf 2.3.0
packaging 23.2
paramiko 2.9.3
pexpect 4.8.0
Pillow 9.0.1
pip 22.0.2
protobuf 3.12.4
psutil 5.9.6
ptyprocess 0.7.0
pycairo 1.20.1
pycups 2.0.1
PyGObject 3.42.1
PyJWT 2.3.0
pymacaroons 0.13.0
PyNaCl 1.5.0
pyparsing 2.4.7
PyQt5 5.15.10
PyQt5-Qt5 5.15.2
PyQt5-sip 12.13.0
pyRFC3339 1.1
python-apt 2.4.0+ubuntu2
python-dateutil 2.8.1
python-debian 0.1.43+ubuntu1.1
pytz 2022.1
pyxdg 0.27
PyYAML 5.4.1
regex 2023.10.3
reportlab 3.6.8
requests 2.25.1
safetensors 0.4.1
screen-resolution-extra 0.0.0
SecretStorage 3.3.1
setuptools 59.6.0
six 1.16.0
ssh-import-id 5.11
stable-fast 0.0.13.post3+torch210cu121
sympy 1.12
systemd-python 234
tokenizers 0.15.0
torch 2.1.0
torchvision 0.16.0
tqdm 4.66.1
transformers 4.35.2
triton 2.1.0
typing_extensions 4.8.0
ubuntu-advantage-tools 8001
ubuntu-drivers-common 0.0.0
ufw 0.36.1
unattended-upgrades 0.1
urllib3 1.26.5
usb-creator 0.3.7
wadllib 1.3.6
wheel 0.37.1
xdg 5
xformers 0.0.22.post7
xkit 0.0.0
zipp 1.0.0

and error is below:

Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.96it/s]
/home/sd/.local/lib/python3.10/site-packages/torch/cuda/graphs.py:88: UserWarning: The CUDA Graph is empty. This usually means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:192.)
super().capture_end()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:159: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:218: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:228: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:214: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/home/sd/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/home/sd/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:23: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:253: TracerWarning: torch.Tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return super().new(cls, x, *args, **kwargs)
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:123: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
0%| | 0/30 [00:00<?, ?it/s]/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:197: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bool(tensors[start].item()), start + 1
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py:878: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if dim % default_overall_up_factor != 0:
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:271: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:173: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:186: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if hidden_states.shape[0] >= 64:
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: cannot find -lcuda: No such file or directory
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
collect2: error: ld returned 1 exit status
0%| | 0/30 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py", line 150, in
main()
File "/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py", line 132, in main
model(**get_kwarg_inputs())
File "/home/sd/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 918, in call
noise_pred = self.unet(
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 29, in dynamic_graphed_callable
cached_callable = simple_make_graphed_callable(
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 46, in simple_make_graphed_callable
return make_graphed_callable(callable,
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 75, in make_graphed_callable
callable(*tree_copy(example_inputs),
File "/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 62, in wrapper
return traced_module(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 119, in forward
outputs = self.module(*self.convert_inputs(args, kwargs))
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):

graph(%input, %num_groups, %weight, %bias, %eps, %cudnn_enabled):
%y : Tensor = sfast_triton::group_norm_silu(%input, %num_groups, %weight, %bias, %eps)
~~~~~~~~~~~~ <--- HERE
return (%y)
RuntimeError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpy9k09g46/main.c', '-O3', '-I/home/sd/.local/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/usr/include/python3.10', '-I/tmp/tmpy9k09g46', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpy9k09g46/group_norm_4d_channels_last_forward_collect_stats_kernel.cpython-310-x86_64-linux-gnu.so', '-L/lib/x86_64-linux-gnu', '-L/lib/i386-linux-gnu', '-L/lib/i386-linux-gnu']' returned non-zero exit status 1.

At:
/usr/lib/python3.10/subprocess.py(369): check_call
/home/sd/.local/lib/python3.10/site-packages/triton/common/build.py(90): _build
/home/sd/.local/lib/python3.10/site-packages/triton/compiler/make_launcher.py(39): make_stub
/home/sd/.local/lib/python3.10/site-packages/triton/compiler/compiler.py(425): compile
(63): group_norm_4d_channels_last_forward_collect_stats_kernel
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/init.py(35): new_func
/home/sd/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py(232): run
/home/sd/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py(232): run
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/ops/group_norm.py(425): group_norm_forward
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/torch_ops.py(188): forward
/home/sd/.local/lib/python3.10/site-packages/torch/autograd/function.py(539): apply
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/torch_ops.py(226): group_norm_silu
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py(119): forward
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py(62): wrapper
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(75): make_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(46): simple_make_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(29): dynamic_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(918): call
/home/sd/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py(115): decorate_context
/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py(132): main
/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py(150):

can anyone at least guid me what could be wrong? Thanks

RuntimeError for Demo Test

Hi,

Thanks for your work. I am running the demo test but having this issue when running the inference after successful compile:

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):

graph(%1, %2):
    %x : Tensor = sfast_triton::reshape(%1, %2)
                  ~~~~~~~~~~~~ <--- HERE
    return (%x)
RuntimeError: RuntimeError: shape '[512, 512, 64, 64]' is invalid for input of size 2097152

At:
  /opt/sd/lib/python3.10/site-packages/torch/_ops.py(502): __call__
  /opt/sd/lib/python3.10/site-packages/sfast/triton/torch_ops.py(82): forward
  /opt/sd/lib/python3.10/site-packages/torch/autograd/function.py(506): apply
  /opt/sd/lib/python3.10/site-packages/sfast/triton/torch_ops.py(97): reshape
  /opt/sd/lib/python3.10/site-packages/torch/nn/modules/module.py(1501): _call_impl
  /opt/sd/lib/python3.10/site-packages/sfast/jit/trace_helper.py(111): forward
  /opt/sd/lib/python3.10/site-packages/torch/nn/modules/module.py(1501): _call_impl
  /opt/sd/lib/python3.10/site-packages/sfast/jit/trace_helper.py(57): wrapper
  /opt/sd/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(707): __call__
  /opt/sd/lib/python3.10/site-packages/torch/utils/_contextlib.py(115): decorate_context
  /hostroot/experiments/sfast/test_sfast.py(70): <module>

Would you help to have a look? Thanks!

LCM compatible?

What would make it compatible with LCM models?
https://github.com/luosiallen/latent-consistency-model

Running Speed is Slower for SDXL Model

Hi, another issue that I found is it's not accelerating SDXL. I'm running the demo with A100, the speed of SDXL for the compiled model is 5.3 it/s but with normal diffusers it's 8.8 it/s. The compiled one with stable-fast is slower.

加微信聊一下

hello,zeyi:
您git上所留的[email protected]邮箱貌似没发发送邮件联系，能否加个微信聊一下，我这边也是做sd加速的算法工程师。
我的邮箱是[email protected]。

Fails to compile model in docker, missing installs?

Hello, I was able to get this repo working with sdxl turbo on my 4090 using a venv, but I tried to dockerize the build and it repeatedly fails on a missing tmp file. I am wonder if you have seen this issue before and if I perhaps am missing some apt or pip installs that are missing in my docker image causing this issue. Appreciate your help @chengzeyi.

Dockerfile:
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked
--mount=type=cache,target=/var/lib/apt,sharing=locked
apt-get -y update
&& apt-get install -y --no-install-recommends python3.10 python-is-python3 git libgl1 libsndfile1 pip ffmpeg google-perftools
libvulkan1 libnvidia-gl-525-server mesa-vulkan-drivers gcc build-essential
&& apt-get autoremove -y
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip setuptools wheel --no-cache-dir

WORKDIR testing
COPY requirements2.txt requirements2.txt
RUN pip install -r requirements2.txt --no-cache-dir

COPY . .

CMD ["python3", "examples/optimize_stable_diffusion_pipeline.py"]

Error logs:

INFO:root:Tracing forward
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:159: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:218: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:228: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:214: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:23: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:253: TracerWarning: torch.Tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return super().new(cls, x, *args, **kwargs)
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:123: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
0%| | 0/30 [00:00<?, ?it/s]INFO:root:Dynamically graphing forward
/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py:88: UserWarning: The CUDA Graph is empty. This usually means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:192.)
super().capture_end()
INFO:root:Tracing forward
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:197: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bool(tensors[start].item()), start + 1
/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py:878: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if dim % default_overall_up_factor != 0:
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:271: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:173: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:186: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if hidden_states.shape[0] >= 64:
/tmp/tmpuumtrxyy/main.c:4:10: fatal error: Python.h: No such file or directory
4 | #include <Python.h>
| ^~~~~~~~~~
compilation terminated.
0%| | 0/30 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/testing/examples/optimize_stable_diffusion_pipeline.py", line 81, in
output_image = model(**kwarg_inputs).images[0]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 918, in call
noise_pred = self.unet(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 29, in dynamic_graphed_callable
cached_callable = simple_make_graphed_callable(
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 46, in simple_make_graphed_callable
return make_graphed_callable(callable,
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 75, in make_graphed_callable
callable(*tree_copy(example_inputs),
File "/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py", line 55, in wrapper
return traced_module(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py", line 112, in forward
outputs = self.module(*self.convert_inputs(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):

graph(%input, %num_groups, %weight, %bias, %eps, %cudnn_enabled):
%y : Tensor = sfast_triton::group_norm_silu(%input, %num_groups, %weight, %bias, %eps)
~~~~~~~~~~~~ <--- HERE
return (%y)
RuntimeError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpuumtrxyy/main.c', '-O3', '-I/usr/local/lib/python3.10/dist-packages/triton/common/../third_party/cuda/include', '-I/usr/include/python3.10', '-I/tmp/tmpuumtrxyy', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpuumtrxyy/group_norm_4d_channels_last_forward_collect_stats_kernel.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib/x86_64-linux-gnu']' returned non-zero exit status 1.

At:
/usr/lib/python3.10/subprocess.py(369): check_call
/usr/local/lib/python3.10/dist-packages/triton/common/build.py(90): _build
/usr/local/lib/python3.10/dist-packages/triton/compiler/make_launcher.py(39): make_stub
/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py(425): compile
(63): group_norm_4d_channels_last_forward_collect_stats_kernel
/usr/local/lib/python3.10/dist-packages/sfast/triton/init.py(35): new_func
/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py(232): run
/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py(232): run
/usr/local/lib/python3.10/dist-packages/sfast/triton/ops/group_norm.py(437): group_norm_forward
/usr/local/lib/python3.10/dist-packages/sfast/triton/torch_ops.py(186): forward
/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py(539): apply
/usr/local/lib/python3.10/dist-packages/sfast/triton/torch_ops.py(224): group_norm_silu
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py(112): forward
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py(55): wrapper
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(75): make_graphed_callable
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(46): simple_make_graphed_callable
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(29): dynamic_graphed_callable
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(918): call
/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(115): decorate_context
/testing/examples/optimize_stable_diffusion_pipeline.py(81):

Pip Freeze:
accelerate==0.25.0
annotated-types==0.6.0
anyio==3.7.1
certifi==2023.11.17
charset-normalizer==3.3.2
click==8.1.7
diffusers==0.24.0
exceptiongroup==1.2.0
fastapi==0.104.1
filelock==3.13.1
fsspec==2023.12.0
h11==0.14.0
huggingface-hub==0.19.4
idna==3.6
importlib-metadata==7.0.0
Jinja2==3.1.2
MarkupSafe==2.1.3
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
packaging==23.2
Pillow==10.1.0
psutil==5.9.6
pydantic==2.5.2
pydantic_core==2.14.5
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
safetensors==0.4.1
sniffio==1.3.0
stable-fast @ https://github.com/chengzeyi/stable-fast/releases/download/v0.0.12.post6/stable_fast-0.0.12.post6+torch210cu121-cp310-cp310-manylinux2014_x86_64.whl
starlette==0.27.0
sympy==1.12
tokenizers==0.15.0
torch==2.1.0
tqdm==4.66.1
transformers==4.35.2
triton==2.1.0
typing_extensions==4.8.0
urllib3==2.1.0
uvicorn==0.24.0.post1
xformers==0.0.22.post7
zipp==3.17.0

Does xformers still matter?

Hi, I was wondering since you have done extensive benchmarking you probably have details on this. It seems that you have built-in support for xformers in there, but I thought pytorch 2.0 introduces an equivalent mechanism built-in? Should I still care about xformers? Cheers

How to input fixed latents to the model during inference

Hi, I would like to know how to input fixed latents to the model during inference. Because I need fixed latents to avoid random noise for comparing the results of stable fast and PyTorch. I have set the input parameters to the following format, but it doesn't work.

`latents = np.load('latents.npy', mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')
latents = torch.from_numpy(latents).half().cuda()

kwarg_inputs = dict(
prompt = prompt,
latents = latents,
height=512,
width=512,
num_inference_steps=20,
num_images_per_prompt=1,
)
output_image = compiled_model(**kwarg_inputs).images[0]

AUTOMATIC1111 SD Webui is enabled? How to use it in SD Webui?

RuntimeError: _Map_base::at`

hi, i was trying this out for maximum optimization in aws G5 instance on ubuntu (it's just an nvidia A10g) and i was using comfy ui by calling on the nodes itself in python code and i kept getting this error message that i coudn't solve.
how would i be able to resolve this?
all the dependencies regarding the pytorch and 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0' 'torch>=1.12.0' was met and it worked on my desktop but does not in ubuntu.

/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:157: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:216: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:226: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:212: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:203: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return int(tensors[start].item()), start + 1
0%| | 0/12 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ubuntu/test/workflow_clip_sdxl2.py", line 320, in
main()
File "/home/ubuntu/test/workflow_clip_sdxl2.py", line 229, in main
ksampler_3 = ksampler.sample(
File "/home/ubuntu/ComfyUI/nodes.py", line 1286, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
File "/home/ubuntu/ComfyUI/nodes.py", line 1256, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI-Impact-Pack/modules/impact/sample_error_enhancer.py", line 22, in informative_sample
raise e
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI-Impact-Pack/modules/impact/sample_error_enhancer.py", line 9, in informative_sample
return original_sample(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/sample.py", line 100, in sample
samples = sampler.sample(noise, positive_copy, negative_copy, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 711, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 617, in sample
samples = sampler.sample(model_wrap, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 556, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/k_diffusion/sampling.py", line 137, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 277, in forward
out = self.inner_model(x, sigma, cond=cond, uncond=uncond, cond_scale=cond_scale, model_options=model_options, seed=seed)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 267, in forward
return self.apply_model(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 264, in apply_model
out = sampling_function(self.inner_model, x, timestep, uncond, cond, cond_scale, model_options=model_options, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 252, in sampling_function
cond, uncond = calc_cond_uncond_batch(model, cond, uncond, x, timestep, model_options)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 228, in calc_cond_uncond_batch
output = model_options['model_function_wrapper'](model.apply_model, {"input": input_x, "timestep": timestep, "c": c, "cond_or_uncond": cond_or_uncond}).chunk(batch_chunks)
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI_stable_fast/node.py", line 69, in call
return self.stable_fast_model.get_traced_module(input_x, timestep, **c)[0](
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI_stable_fast/module/stable_diffusion_pipeline_compiler.py", line 62, in get_traced_module
traced_m, call_helper = trace_with_kwargs(
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 23, in trace_with_kwargs
traced_module = better_trace(TraceablePosArgOnlyModuleWrapper(func),
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/utils.py", line 29, in better_trace
script_module = torch.jit.trace(func, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/jit/_trace.py", line 798, in trace
return trace_module(
File "/opt/conda/lib/python3.10/site-packages/torch/jit/_trace.py", line 1065, in trace_module
module._c._create_method_from_trace(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 127, in forward
outputs = self.module(*orig_args, **orig_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 77, in forward
return self.func(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/model_base.py", line 68, in apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 619, in forward
h = forward_timestep_embed(module, h, emb, context, transformer_options)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 35, in forward_timestep_embed
x = layer(x, emb)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 210, in forward
return checkpoint(
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/util.py", line 121, in checkpoint
return CheckpointFunction.apply(func, len(inputs), *args)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
RuntimeError: _Map_base::at

Potential improvements to stable-fast

Hi chengzeyi,
Wanted to first off congratulate you on this awesome work! I have actually also been working on a similar project here but I have recently stopped development since your project has already been widely adopted. However, there are some features that I have been working on that I believe could enhance stable-fast

Using torch.fx to overwrite and accelerate unet
Supporting tensor parallelism for people with more than 1 gpu
INT4 quantization with GPTQ
Sparse inference

I would love to work with you on the above topics if possible since I have already partially implemented quite a few of these! Please let me know if you could see us collaborating in the future..

Failing to compile

Using Linux 22.04, Torch 2., diffusers 0.22.1, xformers 0.0.22.post7, triton 2.1
I've posted the entire log of the build hoping it helps isolate what I'm doing wrong.

Using pip 23.3.1 from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/pip (python 3.10)
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting stable-fast
  Cloning https://github.com/chengzeyi/stable-fast.git (to revision main) to /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
  Running command git version
  git version 2.34.1
  Running command git clone --filter=blob:none https://github.com/chengzeyi/stable-fast.git /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
  Cloning into '/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273'...
  Running command git show-ref main
  52afadfe0f49b2aa13e9ac15566cedb4e0732784 refs/heads/main
  52afadfe0f49b2aa13e9ac15566cedb4e0732784 refs/remotes/origin/main
  Running command git symbolic-ref -q HEAD
  refs/heads/main
  Resolved https://github.com/chengzeyi/stable-fast.git to commit 52afadfe0f49b2aa13e9ac15566cedb4e0732784
  Running command git rev-parse HEAD
  52afadfe0f49b2aa13e9ac15566cedb4e0732784
  Running command python setup.py egg_info
  running egg_info
  creating /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info
  writing /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/dependency_links.txt
  writing requirements to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/requires.txt
  writing top-level names to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/top_level.txt
  writing manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
  reading manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
  Preparing metadata (setup.py) ... done
Requirement already satisfied: packaging in ./venv/lib/python3.10/site-packages (from stable-fast) (23.2)
Requirement already satisfied: torch>=1.12.0 in ./venv/lib/python3.10/site-packages (from stable-fast) (2.1.0)
Requirement already satisfied: filelock in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.13.1)
Requirement already satisfied: typing-extensions in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (4.8.0)
Requirement already satisfied: sympy in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (1.12)
Requirement already satisfied: networkx in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.2.1)
Requirement already satisfied: jinja2 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.1.2)
Requirement already satisfied: fsspec in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2023.10.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (8.9.2.26)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (11.0.2.54)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (10.3.2.106)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (11.4.5.107)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.0.106)
Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2.18.1)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: triton==2.1.0 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2.1.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in ./venv/lib/python3.10/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.12.0->stable-fast) (12.3.52)
Requirement already satisfied: MarkupSafe>=2.0 in ./venv/lib/python3.10/site-packages (from jinja2->torch>=1.12.0->stable-fast) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in ./venv/lib/python3.10/site-packages (from sympy->torch>=1.12.0->stable-fast) (1.3.0)
Building wheels for collected packages: stable-fast
  Running command python setup.py bdist_wheel
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-cpython-310
  creating build/lib.linux-x86_64-cpython-310/sfast
  copying sfast/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast
  creating build/lib.linux-x86_64-cpython-310/sfast/jit
  copying sfast/jit/trace_helper.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
  copying sfast/jit/utils.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
  copying sfast/jit/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
  creating build/lib.linux-x86_64-cpython-310/sfast/compilers
  copying sfast/compilers/stable_diffusion_pipeline_compiler.py -> build/lib.linux-x86_64-cpython-310/sfast/compilers
  copying sfast/compilers/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/compilers
  creating build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/aot_printer.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/patch.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/copy_func.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/custom_python_operator.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/torch_dispatch.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/flat_tensors.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/gpu_device.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/memory_format.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/xformers_attention.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/env.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  copying sfast/utils/compute_precision.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
  creating build/lib.linux-x86_64-cpython-310/sfast/triton
  copying sfast/triton/torch_ops.py -> build/lib.linux-x86_64-cpython-310/sfast/triton
  copying sfast/triton/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton
  creating build/lib.linux-x86_64-cpython-310/sfast/dynamo
  copying sfast/dynamo/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo
  creating build/lib.linux-x86_64-cpython-310/sfast/cuda
  copying sfast/cuda/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/cuda
  copying sfast/cuda/graphs.py -> build/lib.linux-x86_64-cpython-310/sfast/cuda
  creating build/lib.linux-x86_64-cpython-310/sfast/jit/passes
  copying sfast/jit/passes/triton_passes.py -> build/lib.linux-x86_64-cpython-310/sfast/jit/passes
  copying sfast/jit/passes/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/jit/passes
  creating build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/image_to_ansi.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/imgcat.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/kdtree.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/climage.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  copying sfast/utils/term_image/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
  creating build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  copying sfast/triton/modules/patch.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  copying sfast/triton/modules/native.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  copying sfast/triton/modules/diffusers.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  copying sfast/triton/modules/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
  creating build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/group_norm.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/copy.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/activation.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  copying sfast/triton/ops/conv.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
  creating build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
  copying sfast/dynamo/backends/sfast_jit.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
  copying sfast/dynamo/backends/registry.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
  copying sfast/dynamo/backends/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
  running build_ext
  /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.3) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
    warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
  /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.3
    warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
  building 'sfast._C' extension
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas
  creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn
  Emitting ninja build file /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/13] /usr/local/cuda/bin/nvcc  -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
  FAILED: /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o
  /usr/local/cuda/bin/nvcc  -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
  In file included from /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu:7:
  /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/ATen/cudnn/cudnn-wrapper.h:3:10: fatal error: cudnn.h: No such file or directory
      3 | #include <cudnn.h>
        |          ^~~~~~~~~
  compilation terminated.
  [2/13] /usr/local/cuda/bin/nvcc  -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
  /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu(291): warning #177-D: variable "falpha" was declared but never referenced
      float falpha = alpha;
            ^

  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

  /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu(292): warning #177-D: variable "fbeta" was declared but never referenced
      float fbeta = beta;
            ^

  [3/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [4/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [5/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [6/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [7/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [8/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [9/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [10/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  In file included from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
                   from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/jit/python/pybind_utils.h:26,
                   from /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.cpp:10:
  /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
    104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
        |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
  [11/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [12/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [13/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp:641:55: warning: "/*" within comment [-Wcomment]
    641 |               beta_, self.scalar_type(), c10::nullopt /* layout */
        |
  /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp:642:28: warning: "/*" within comment [-Wcomment]
    642 | /*, at::kCPU, c10::nullopt /* pin_memory */ /*));
        |
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
      subprocess.run(
    File "/usr/lib/python3.10/subprocess.py", line 526, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/setup.py", line 109, in <module>
      setup(
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
      return distutils.core.setup(**attrs)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 369, in run
      self.run_command("build")
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 131, in run
      self.run_command(cmd_name)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 88, in run
      _build_ext.run(self)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
      self.build_extensions()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
      build_ext.build_extensions(self)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
      self._build_extensions_serial()
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
      self.build_extension(ext)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 249, in build_extension
      _build_ext.build_extension(self, ext)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
      super(build_ext, self).build_extension(ext)
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
      objects = self.compiler.compile(
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/bin/python3 -u -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize
  
  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)
  
  __file__ = %r
  sys.argv[0] = __file__
  
  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"
  
  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' bdist_wheel -d /tmp/pip-wheel-_zs0b_wo
  cwd: /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/
  Building wheel for stable-fast (setup.py) ... error
  ERROR: Failed building wheel for stable-fast
  Running setup.py clean for stable-fast
  Running command python setup.py clean
  running clean
  removing 'build/temp.linux-x86_64-cpython-310' (and everything under it)
  removing 'build/lib.linux-x86_64-cpython-310' (and everything under it)
  'build/bdist.linux-x86_64' does not exist -- can't clean it
  'build/scripts-3.10' does not exist -- can't clean it
  removing 'build'
Failed to build stable-fast
ERROR: Could not build wheels for stable-fast, which is required to install pyproject.toml-based projects

Is this package compatible with Lycoris LoRAs

Hello, I see that in the read me that stable-fast supports LoRAs out of the box. Does anyone know if it's possible to use Lycoris LoRAs with this package? The Nvidia TensorRT extension for A1111 seems have questionable support for them.

SDNext Support

Hi @chengzeyi, I sent an email to you last week but I wanted to try again, we'd love to get in touch with you since Vlad has now applied your code to SDNext, we're seeing great results on our dev branch.
Hit us up on our discord server or just reply to my email so we can connect, as we plan on posting our results.

Thanks!

Could it support to Lora inference ?

Thank you for sharing! I would like to ask if this result can be applied to loading and inference with Lora? Thank you!

RuntimeError: no valid convolution algorithms available in CuDNN

The following problem occurred when I call compile(pipe)

Traceback (most recent call last):

File "svd_sf.py", line 49, in

frames = pipe(image, decode_chunk_size=7, num_frames=20).frames[0]

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

return func(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py", line 499, in call

noise_pred = self.unet(

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 40, in dynamic_graphed_callable

cached_callable = simple_make_graphed_callable(

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 61, in simple_make_graphed_callable

return make_graphed_callable(func,

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 90, in make_graphed_callable

func(*tree_copy(example_inputs, detach=True),

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/jit/trace_helper.py", line 64, in wrapper

return traced_module(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/jit/trace_helper.py", line 133, in forward

outputs = self.module(*self.convert_inputs(args, kwargs))

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

RuntimeError: The following operation failed in the TorchScript interpreter.

Traceback of TorchScript (most recent call last):

graph(%1, %2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15):

%x = sfast::cudnn_convolution_bias_add(%1, %2, %3, %14, %15, %4, %5, %6, %7, %8, %9)
     ~~~~~ <--- HERE
return (%x)

RuntimeError: no valid convolution algorithms available in CuDNN

I have cudnn installed on my server. torch.backends.cudnn.is_available() and torch.backends.cudnn.enabled show True

Updated: I successfully ran your example in README.md. Currently I'm trying to accelerate stable video diffusion, which involves very large matmul. So possibly this is the reason?

Is it possible to use this for multiple aspect ratios? Is there a way to decrease graph creation time during first inference when using dynamic shapes?

I am using the Stable-Fast compilation to use SD2.1 models for multiple aspect ratios. It takes about 50 seconds to infer on each size and then speeds up inference.

Is there a way to speed up the process?

'torch' module not found during 'stable-fast' installation

Environment Information:

Python Version: 3.10.13
Virtual Environment: venv (confirmed)
torch Version: 2.1.0 (installed in the venv)
pip Version: 23.3.1
OS: Linux

Problem Description:
When trying to install the stable-fast package, I encounter a ModuleNotFoundError: No module named 'torch'. I have confirmed that torch is installed in the virtual environment using pip list. Also, no package conflicts were detected with pip check.

Attempted Solutions:

Reconfirmed the installation of torch in the virtual environment.
Tested in a different virtual environment.

Additional Information:
The same error occurs when installing directly from the GitHub repository and from PyPI.

I would appreciate any suggestions or solutions to this problem.

Breakdown of individual optimizations

First of all, congrats on the release of such an amazing project!! Reading through the list of individual optimizations / 'features', I am super curious about how much each one of them contribute to the total performance increase and whether there is an reasonably smaller set of operations that can yield 85-90% of the whole perf while reducing the surface area by a lot. While developing stable fast, even though they might be unconfirmed, have any numbers been collected between new optimization passes?

Black image with relelases post1-post3 and nightly

Releases after 12.0, aka 12.0 post1-12.0 post 3 and the nightly generate black images with the same dependencies as 12.0
Using the NV CUDA 12.1 container as base:
requirements.txt
diffusers==0.23.1
xformers
torch==2.1.0
nvidia-pytriton==0.4.1
numpy==1.26.2
triton==2.1.0
requests
transformers==4.35.2
tokenizers==0.15.0
https://github.com/chengzeyi/stable-fast/releases/download/v0.0.12/stable_fast-0.0.12_torch210_cu121-cp310-cp310-manylinux2014_x86_64.whl

Found an unsupported argument type in the JIT tracer: at::Generator.

Hi, I'm running SD 2.1, and got the following errors. How can I fix it?

  File "/zhangjun/git/stable-fast/src/sfast/jit/trace_helper.py", line 78, in forward
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/schedulers/scheduling_euler_ancestral_discrete.py", line 356, in step
    noise = randn_tensor(model_output.shape, dtype=model_output.dtype, device=device, generator=generator)
  File "/opt/conda/lib/python3.10/site-packages/diffusers/utils/torch_utils.py", line 80, in randn_tensor
    latents = torch.randn(shape, generator=generator, device=rand_device, dtype=dtype, layout=layout).to(device)
RuntimeError: Found an unsupported argument type in the JIT tracer: at::Generator. File a bug report.

Config options as follow.

  config = CompilationConfig.Default()
  try:
      import xformers
      config.enable_xformers = True
  except ImportError:
      print('xformers not installed, skip')
  try:
      import triton
      config.enable_triton = True
      torch.backends.cuda.matmul.allow_tf32 = True
  except ImportError:
      print('Triton not installed, skip')
  config.enable_cuda_graph = True
  config.trace_scheduler = True

Inference code:

scheduler = EulerAncestralDiscreteScheduler.from_pretrained(
    model_path, subfolder="scheduler"
)
pipeline = StableDiffusionPipeline.from_pretrained(
    model_path,
    scheduler=scheduler,
    torch_dtype=self.dtype,
    safety_checker=None,
)

# compile 
...
pipeline = compile(pipeline, config)

sfast_inputs = dict(
    prompt=prompt,
    negative_prompt=neg_prompt,
    generator=torch.Generator(device='cuda').manual_seed(seed),
    width=width,
    height=height,
    num_inference_steps=num_inference_steps,
)
image = pipeline(**sfast_inputs).images

Environment Information:

diffusers==0.23.1
transformers=4.35.2
peft==0.6.2

Stable Video Optimizations?

Do you think any of these optimizations could be applied to Stable Video Diffusion? I'd like to help here if possible.

Optimizing Controlnets independently?

First, awesome work. The speed improvement is great, and the additional cold start time is very manageable.

Second, I load many pipelines for each model, and do things like swap out controlnets and vaes in response to user requests. When I swap controlnets and run a generation, I am getting the error RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float. I'm guessing this is because the replacement controlnet hasn't had the same optimizations as the original. Is there a way to compile just the controlnets so that they can be swapped in and out of the pipeline dynamically?

chengzeyi / stable-fast Goto Github PK

stable-fast's Issues

Recommend Projects

Recommend Topics

Recommend Org