chengzeyi / stable-fast Goto Github PK
View Code? Open in Web Editor NEWBest inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
License: MIT License
Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
License: MIT License
2023-12-07 17:35:10.128 [stderr ] File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2023-12-07 17:35:10.128 [stderr ] return func(*args, **kwargs)
2023-12-07 17:35:10.128 [stderr ] File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py", line 926, in __call__
2023-12-07 17:35:10.128 [stderr ] image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
2023-12-07 17:35:10.128 [stderr ] File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 36, in dynamic_graphed_callable
2023-12-07 17:35:10.128 [stderr ] cached_callable = simple_make_graphed_callable(
2023-12-07 17:35:10.128 [stderr ] File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 53, in simple_make_graphed_callable
2023-12-07 17:35:10.128 [stderr ] return make_graphed_callable(callable,
2023-12-07 17:35:10.128 [stderr ] File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 98, in make_graphed_callable
2023-12-07 17:35:10.129 [stderr ] static_inputs = shadow_copy(static_inputs_)
2023-12-07 17:35:10.129 [stderr ] File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 49, in shadow_copy
2023-12-07 17:35:10.129 [stderr ] return type(obj)(shadow_copy(x, detach=detach) for x in obj)
2023-12-07 17:35:10.129 [stderr ] File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 49, in <genexpr>
2023-12-07 17:35:10.129 [stderr ] return type(obj)(shadow_copy(x, detach=detach) for x in obj)
2023-12-07 17:35:10.129 [stderr ] File "/root/.cache/isolate/virtualenv/cb0c4d3222905a6bb1bceaa9f8e4dae878a144d056a139f4dbce875ede43363e/lib/python3.10/site-packages/sfast/utils/copy.py", line 45, in shadow_copy
2023-12-07 17:35:10.129 [stderr ] return sfast._C._create_shadow_tensor(
2023-12-07 17:35:10.129 [stderr ] RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.
when do deterministic generation by passing a torch.Generator to the pipeline, stable-fast raised an AssertionError
Traceback (most recent call last):
File "/data/dbgsfast/dbgsfast/__init__.py", line 43, in <module>
model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(41))
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 958, in __call__
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 32, in dynamic_graphed_callable
return cached_callable(*args, **kwargs)
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 143, in functionalized
return _graphed_module(*user_args, **user_kwarg_args)
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 130, in forward
outputs = self._forward(*inputs, **kwarg_inputs)
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 136, in _forward
tree_copy_(static_kwarg_inputs, kwarg_inputs)
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/utils/copy.py", line 20, in tree_copy_
tree_copy_(dest[k], src[k])
File "/data/dbgsfast/.venv/lib/python3.10/site-packages/sfast/utils/copy.py", line 22, in tree_copy_
assert dest == src
code to reproduce
import torch
from diffusers import StableDiffusionPipeline, EulerAncestralDiscreteScheduler
from sfast.compilers.stable_diffusion_pipeline_compiler import (
compile,
CompilationConfig,
)
def load_model():
model = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
)
model.scheduler = EulerAncestralDiscreteScheduler.from_config(
model.scheduler.config
)
model.safety_checker = None
model.to(torch.device("cuda"))
return model
model = load_model()
config = CompilationConfig.Default()
config.enable_cuda_graph = True
model = compile(model, config)
kwarg_inputs = dict(
prompt="(masterpiece:1,2), best quality, masterpiece, best detail face, a beautiful girl",
num_inference_steps=30,
num_images_per_prompt=1,
height=512,
width=512,
)
model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(42))
model(**kwarg_inputs, generator=torch.Generator(device="cuda").manual_seed(41))
Hey!
I have access to A100s, and would like to submit the results from A100. Let me know if you have some sort of benchmark script up and ready so that I can use it to run and get the numbers.
Hi! I saw your message about your model. What is your model based on? Do you have a version compatible with comfy ui or draw things on Mac? Can it run on Mac MPS?
Any ideas? I've found barely any mention of what freezing is or does.
It didn't do this with the runwayml/stable-diffusion-v1-5 model in the example.
Is the .ckpt file format required?
```You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None
. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at huggingface/diffusers#254 .
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/configuration_utils.py:134: FutureWarning: Accessing config attribute requires_safety_checker
directly via 'StableDiffusionPipeline' object attribute is deprecated. Please access 'requires_safety_checker' over 'StableDiffusionPipeline's config object instead, e.g. 'scheduler.config.requires_safety_checker'.
deprecate("direct config name access", "1.0.0", deprecation_message, standard_warn=False)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:750: FutureWarning: torch_dtype
is deprecated and will be removed in version 0.25.0.
deprecate("torch_dtype", "0.25.0", "")
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:157: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:216: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:226: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:212: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:21: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:251: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([obj_id], dtype=torch.int64)
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:121: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
Traceback (most recent call last):
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/stable-fast-exp.py", line 73, in
output_image = compiled_model(**kwarg_inputs).images[0]
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 756, in call
prompt_embeds, negative_prompt_embeds = self.encode_prompt(
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 352, in encode_prompt
prompt_embeds = self.text_encoder(text_input_ids.to(device), attention_mask=attention_mask)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 51, in wrapper
traced_m = ts_compiler(traced_m, call_helper, args,
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/compilers/stable_diffusion_pipeline_compiler.py", line 82, in ts_compiler
m = jit_utils.better_freeze(m)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/sfast/jit/utils.py", line 35, in better_freeze
freezed_module = torch.jit.freeze(script_module, *args, **kwargs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/jit/_freeze.py", line 108, in freeze
raise RuntimeError(
RuntimeError: Freezing is currently only implemented for modules in eval mode. Please call .eval() on your module before freezing.```
Thank you for this great work. It's amazing that it could reach almost same performance with TensorRT on N-GPU!
However, is there a convincing reason of using it instead of TensorRT?
If env e.g. docker has no installation of torch yet, this setup will fail.
Rather, torch should be imported lazily and the asserts performed only if torch is available
I think torch.compile should work on HF diffusers: https://huggingface.co/docs/diffusers/optimization/torch2.0#a100-batch-size-1
I've been trying to get it to work on Windows, but I'm currently stuck at this import error of some DLL file it seems. Does anyone know how to fix this issue, or some tips to try and get it working on Windows?
Traceback (most recent call last):
File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\nodes.py", line 1800, in load_custom_node
module_spec.loader.exec_module(module)
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast\__init__.py", line 1, in <module> from .node import ApplyStableFastUnet
File "C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast\node.py", line 2, in <module>
from sfast.compilers.stable_diffusion_pipeline_compiler import CompilationConfig
File "C:\Users\USER\ComfyUI_windows_portable\python_embeded\Lib\site-packages\sfast\__init__.py", line 23, in <module>
import sfast._C as _C
ImportError: DLL load failed while importing _C: The specified module could not be found.
Cannot import C:\Users\USER\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_stable_fast module for custom nodes: DLL load failed while importing _C: The specified module could not be found.
If we're going for raw speed autoencoder_tiny should present a massive speedup. https://huggingface.co/docs/diffusers/api/models/autoencoder_tiny
I'm not sure exactly how this works but it seems like the issue is here:
Commenting this line out allows my models to compile with TAE and provides approximately a 50ms speedup on my RTX 3090
Did not work until I added this flag: `--disable-cuda-malloc`. Was failing in the compilation step with some CUDA runtime error. I have 16GB VRAM (4060Ti). Around 12% speedup with SDXL 768x768 in WSL2 :)
Originally posted by @Soumyajit in gameltb/ComfyUI_stable_fast#1 (comment)
Trying to compile on my machine results in:
FAILED: /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o.d -DWITH_CUDA -I/home/dirkson/ai/depend/stable-fast/sfast/csrc -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/TH -I/home/dirkson/ai/lib/python3.11/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/dirkson/ai/include -I/usr/include/python3.11 -c -c /home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /home/dirkson/ai/depend/stable-fast/build/temp.linux-x86_64-cpython-311/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu(1210): error: no suitable user-defined conversion from "at::IntArrayRef" to "c10::SymIntArrayRef" exists
input_r, weight_r, bias_opt, stride_, fromIntArrayRefUnchecked(padding_),
^
/home/dirkson/ai/depend/stable-fast/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu(1211): error: no suitable user-defined conversion from "at::IntArrayRef" to "c10::SymIntArrayRef" exists
dilation_, transposed_, fromIntArrayRefUnchecked(output_padding_),
My stack is fairly up-to-date, more so than standard, I think. Python 3.11.5 with pytorch nightlies. nvcc/cuda versioning:
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Let me know is there's other versioning that might help! My knowledge of C++ isn't good enough that I can really grok this error message, otherwise you'd be getting a PR rather than an issue. Sorry!
Cheers!
Hey
I am trying to swap lora of the compiled model with the sample code given in readme. I get this error
When I try to replace the weights myself, I am getting very bad outputs
My snippet of trying to swap weights
state_dict = pipe.unet.state_dict()
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy_face")
update_state_dict(state_dict, pipe.unet.state_dict())
pipe.unet.load_state_dict(state_dict, assign=True)
Then infer.
100%|██████████| 50/50 [00:00<00:00, 58.00it/s]
58 Iterations per second. (with torch.cuda.synchronize
)
Settings:
A100 40GB, fp16, batch size 1
height=512,
width=512,
num_inference_steps=50,
num_images_per_prompt=1,
Wall clock time for full pipeline: 862ms. (no torch.cuda.synchronize
). 927ms (with torch.cuda.synchronize
)
torch version: torch==2.2.0.dev20231203+cu121
Also, please update wallclock time and iteration/s for other GPUs on torch nightly. If extrapolation serves us right, it ought to be faster than TensorRT.
2023-12-04 03:24:33.549 [stderr ] 100%|██████████| 50/50 [00:00<00:00, 79.00it/s]
79 Iterations per second. (naive measurement)
Wall clock time for full pipeline: 862ms.
torch eager and not channels first: wallclock time: 1763ms
100%|██████████| 50/50 [00:01<00:00, 29.55it/s]
(naive measurement)
for a simple workflow like this
If you run with one model and then switch to another, pytorch will fail at here
This is because the mempool in sfast.cuda.graphs._per_device_execution_envs references graphs has been released. Should we make it per model?
Originally posted by @gameltb in gameltb/ComfyUI_stable_fast#1 (comment)
Since the setup is very peculiar and computationally intensive (and also requires cudnn/cublas on the target machine), would it make sense to distribute/attach wheels with releases?
I use stable-fast ==0.0.13.post3 to test a lcm lora, the result like this
but to use lcm lora in pure diffsuers is ok
my code like this:
import torch
from diffusers import LCMScheduler, AutoPipelineForText2Image, DiffusionPipeline
from sfast.compilers.stable_diffusion_pipeline_compiler import (
compile, CompilationConfig)
import numpy as np
from PIL import Image
base_model_path = "runwayml/stable-diffusion-v1-5"
lcm_path = "latent-consistency/lcm-lora-sdv1-5"
def load_model():
model = DiffusionPipeline.from_pretrained(base_model_path,
torch_dtype=torch.float16,
safety_checker=None,
use_safetensors=True)
model.scheduler = LCMScheduler.from_config(model.scheduler.config)
model.safety_checker = None
model.to(torch.device('cuda'))
#model.unet.load_attn_procs(lcm_path)
model.load_lora_weights(lcm_path)
model.fuse_lora()
return model
def compile_model(model):
config = CompilationConfig.Default()
# xformers and Triton are suggested for achieving best performance.
# It might be slow for Triton to generate, compile and fine-tune kernels.
try:
import xformers
config.enable_xformers = True
except ImportError:
print('xformers not installed, skip')
# NOTE:
# When GPU VRAM is insufficient or the architecture is too old, Triton might be slow.
# Disable Triton if you encounter this problem.
try:
import triton
config.enable_triton = True
except ImportError:
print('Triton not installed, skip')
# NOTE:
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# My implementation can handle dynamic shape with increased need for GPU memory.
# But when your GPU VRAM is insufficient or the image resolution is high,
# CUDA Graph could cause less efficient VRAM utilization and slow down the inference,
# especially when on Windows or WSL which has the "shared VRAM" mechanism.
# If you meet problems related to it, you should disable it.
config.enable_cuda_graph = True
model = compile(model, config)
return model
def main():
prompt = "a rendering of a living room with a couch and a tv"
negative_prompt = "ugly,logo,pixelated,lowres,text,word,cropped,low quality,normal quality,username,watermark,signature,blurry,soft,NSFW,painting,cartoon,hang,occluded objects,Fisheye View"
model = load_model()
model = compile_model(model)
kwarg_inputs = dict(
prompt=prompt,
negative_prompt=negative_prompt,
width=768,
height=512,
num_inference_steps=7,
num_images_per_prompt=1,
guidance_scale=1.5,
)
# NOTE: Warm it up.
# The initial calls will trigger compilation and might be very slow.
# After that, it should be very fast.
for _ in range(3):
output_image = model(**kwarg_inputs).images[0]
# Let's see it!
# Note: Progress bar might work incorrectly due to the async nature of CUDA.
img_total = []
for i in range(2):
output_image = model(
prompt=prompt,
negative_prompt=negative_prompt,
width=768,
height=512,
num_inference_steps=7,
num_images_per_prompt=6,
# generator=generators
).images
img_row = []
for img in output_image:
img_row.append(np.asarray(img))
img = np.hstack(img_row)
img_total.append(img)
image = np.vstack(img_total)
# cv2.putText(image,prompt,(40,50),cv2.FONT_HERSHEY_SIMPLEX,2,(0,0,255),3)
image = Image.fromarray(image)
image.save("./output_lcm.png")
if __name__ == '__main__':
main()
So I was debugging why the deployed model inference was slower than in a Jupyter Notebook.
The reason was that the CI that is building the container image did not build the cuda extensions (because of this).
It would probably be better to invert this logic to throw an error if It cannot build the cuda extensions.
Otherwise many users might not understand why their performance is worse than what should be possible.
This could be bypassed with a env var like OPTIONAL_CUDA=1
or DISABLE_CUDA=1
.
Thanks for the project ❤️ I made a colab. 🥳 I hope you like it. https://github.com/camenduru/stable-fast-colab
I can not build Xformers from source. It crashes with OOM errors.
Installing stable-fast repository also results in erros when running setup.py.
Is there a workaround for this? It runs with torch2.1 but fails with nightly versions.
Hi. Thanks for the great repo. Tested it on SD1.5, the speed increase is really impressive. However, there are problems with SDXL. I'm trying to run it in Google Colab. Here's what I'm doing:
!pip install -q diffusers transformers accelerate omegaconf ninja
!pip install -q https://download.pytorch.org/whl/cu118/torch-2.1.0%2Bcu118-cp310-cp310-linux_x86_64.whl
!pip install -q https://download.pytorch.org/whl/triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
!pip install -q https://download.pytorch.org/whl/cu118/xformers-0.0.22.post4%2Bcu118-cp310-cp310-manylinux2014_x86_64.whl
!pip install -q https://github.com/camenduru/stable-fast/releases/download/colab/stable_fast-0.0.2-cp310-cp310-linux_x86_64.whl
next init and compile model
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile, CompilationConfig)
from diffusers import DiffusionPipeline
import torch, xformers, triton
torch.backends.cuda.matmul.allow_tf32 = True
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda")
config = CompilationConfig.Default()
config.enable_xformers = True
config.enable_triton = True
config.enable_cuda_graph = True
compiled_pipe = compile(pipe, config)
I end up trying to run an inference
prompt = 'a car'
h, w = 1024, 1024
steps = 30
seed = 42
guidance_scale = 7
num_images = 1
image = pipe(
prompt = prompt,
height = h,
width = w,
num_inference_steps = steps,
guidance_scale = guidance_scale,
num_images_per_prompt = num_images,
generator = torch.Generator(device='cuda').manual_seed(seed),
).images
I get the following error
[/usr/local/lib/python3.10/dist-packages/torch/jit/_trace.py](https://localhost:8080/#) in trace_module(mod, inputs, optimize, check_trace, check_inputs, check_tolerance, strict, _force_outplace, _module_class, _compilation_unit, example_inputs_is_kwarg, _store_inputs)
1063 else:
1064 example_inputs = make_tuple(example_inputs)
-> 1065 module._c._create_method_from_trace(
1066 method_name,
1067 func,
RuntimeError: Tracer cannot infer type of BaseModelOutputWithPooling(last_hidden_state=tensor([[[-0.3884, 0.0229, -0.0523, ..., -0.4902, -0.3066, 0.0674],
...
device='cuda:0', dtype=torch.float16)), attentions=None)
:Dictionary inputs to traced functions must have consistent type. Found Tensor and Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]
SDXL inference speed up compare?
Traceback (most recent call last):
File "/workspace/webui-pakage/stable-fast-0.0.2/predict.py", line 3, in
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile,
File "/workspace/webui-pakage/stable-fast-0.0.2/sfast/init.py", line 22, in
import sfast._C as _C
ModuleNotFoundError: No module named 'sfast._C'
how to solve?
env:
A100
xformers 0.0.22.post4+cu118
triton 2.1.0
torch 2.1.0+cu118
transformers 4.34.1
ninja 1.11.1.1
diffusers 0.21.4
stable-fast 0.0.2
Hi, I am trying to run python3 optimize_stable_diffusion_pipeline.py
and I get this nasty error where I cant really tell much what exactly is wrong as it reffers to like everyting included here..
I am pretty sure I have correct version of stable fast, the one corresponding with my python 3.1, cuda 12.1 and torch 2.1, gcc is 11.4..
My system is RTX 3090, i7 etc... quite clean install of ubuntu. Nvidia drivers are 530.
this is my pip3 list:
pip3 list
Package Version
accelerate 0.25.0
antlr4-python3-runtime 4.9.3
apturl 0.5.2
bcrypt 3.2.0
blinker 1.4
Brlapi 0.8.3
certifi 2020.6.20
chardet 4.0.0
click 8.0.3
colorama 0.4.4
command-not-found 0.3
cryptography 3.4.8
cupshelpers 1.0
dbus-python 1.2.18
defer 1.0.6
diffusers 0.24.0
distro 1.7.0
distro-info 1.1+ubuntu0.1
duplicity 0.8.21
fasteners 0.14.1
filelock 3.13.1
fsspec 2023.12.1
future 0.18.2
httplib2 0.20.2
huggingface-hub 0.19.4
idna 3.3
importlib-metadata 4.6.4
jeepney 0.7.1
Jinja2 3.1.2
keyring 23.5.0
language-selector 0.1
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lockfile 0.12.2
louis 3.20.0
macaroonbakery 1.3.1
Mako 1.1.3
MarkupSafe 2.0.1
monotonic 1.6
more-itertools 8.10.0
mpmath 1.3.0
netifaces 0.11.0
networkx 3.2.1
numpy 1.26.2
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.0
olefile 0.46
omegaconf 2.3.0
packaging 23.2
paramiko 2.9.3
pexpect 4.8.0
Pillow 9.0.1
pip 22.0.2
protobuf 3.12.4
psutil 5.9.6
ptyprocess 0.7.0
pycairo 1.20.1
pycups 2.0.1
PyGObject 3.42.1
PyJWT 2.3.0
pymacaroons 0.13.0
PyNaCl 1.5.0
pyparsing 2.4.7
PyQt5 5.15.10
PyQt5-Qt5 5.15.2
PyQt5-sip 12.13.0
pyRFC3339 1.1
python-apt 2.4.0+ubuntu2
python-dateutil 2.8.1
python-debian 0.1.43+ubuntu1.1
pytz 2022.1
pyxdg 0.27
PyYAML 5.4.1
regex 2023.10.3
reportlab 3.6.8
requests 2.25.1
safetensors 0.4.1
screen-resolution-extra 0.0.0
SecretStorage 3.3.1
setuptools 59.6.0
six 1.16.0
ssh-import-id 5.11
stable-fast 0.0.13.post3+torch210cu121
sympy 1.12
systemd-python 234
tokenizers 0.15.0
torch 2.1.0
torchvision 0.16.0
tqdm 4.66.1
transformers 4.35.2
triton 2.1.0
typing_extensions 4.8.0
ubuntu-advantage-tools 8001
ubuntu-drivers-common 0.0.0
ufw 0.36.1
unattended-upgrades 0.1
urllib3 1.26.5
usb-creator 0.3.7
wadllib 1.3.6
wheel 0.37.1
xdg 5
xformers 0.0.22.post7
xkit 0.0.0
zipp 1.0.0
and error is below:
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.96it/s]
/home/sd/.local/lib/python3.10/site-packages/torch/cuda/graphs.py:88: UserWarning: The CUDA Graph is empty. This usually means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:192.)
super().capture_end()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:159: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:218: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:228: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:214: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/home/sd/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/home/sd/.local/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/home/sd/.local/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:23: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:253: TracerWarning: torch.Tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return super().new(cls, x, *args, **kwargs)
/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:123: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
0%| | 0/30 [00:00<?, ?it/s]/home/sd/.local/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:197: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bool(tensors[start].item()), start + 1
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py:878: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if dim % default_overall_up_factor != 0:
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:271: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:173: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/home/sd/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:186: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if hidden_states.shape[0] >= 64:
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: cannot find -lcuda: No such file or directory
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
/usr/bin/ld: skipping incompatible /lib/i386-linux-gnu/libcuda.so when searching for -lcuda
collect2: error: ld returned 1 exit status
0%| | 0/30 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py", line 150, in
main()
File "/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py", line 132, in main
model(**get_kwarg_inputs())
File "/home/sd/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 918, in call
noise_pred = self.unet(
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 29, in dynamic_graphed_callable
cached_callable = simple_make_graphed_callable(
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 46, in simple_make_graphed_callable
return make_graphed_callable(callable,
File "/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py", line 75, in make_graphed_callable
callable(*tree_copy(example_inputs),
File "/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 62, in wrapper
return traced_module(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 119, in forward
outputs = self.module(*self.convert_inputs(args, kwargs))
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
graph(%input, %num_groups, %weight, %bias, %eps, %cudnn_enabled):
%y : Tensor = sfast_triton::group_norm_silu(%input, %num_groups, %weight, %bias, %eps)
~~~~~~~~~~~~ <--- HERE
return (%y)
RuntimeError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpy9k09g46/main.c', '-O3', '-I/home/sd/.local/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/usr/include/python3.10', '-I/tmp/tmpy9k09g46', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpy9k09g46/group_norm_4d_channels_last_forward_collect_stats_kernel.cpython-310-x86_64-linux-gnu.so', '-L/lib/x86_64-linux-gnu', '-L/lib/i386-linux-gnu', '-L/lib/i386-linux-gnu']' returned non-zero exit status 1.
At:
/usr/lib/python3.10/subprocess.py(369): check_call
/home/sd/.local/lib/python3.10/site-packages/triton/common/build.py(90): _build
/home/sd/.local/lib/python3.10/site-packages/triton/compiler/make_launcher.py(39): make_stub
/home/sd/.local/lib/python3.10/site-packages/triton/compiler/compiler.py(425): compile
(63): group_norm_4d_channels_last_forward_collect_stats_kernel
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/init.py(35): new_func
/home/sd/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py(232): run
/home/sd/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py(232): run
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/ops/group_norm.py(425): group_norm_forward
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/torch_ops.py(188): forward
/home/sd/.local/lib/python3.10/site-packages/torch/autograd/function.py(539): apply
/home/sd/.local/lib/python3.10/site-packages/sfast/triton/torch_ops.py(226): group_norm_silu
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py(119): forward
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/sfast/jit/trace_helper.py(62): wrapper
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(75): make_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(46): simple_make_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/sfast/cuda/graphs.py(29): dynamic_graphed_callable
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1527): _call_impl
/home/sd/.local/lib/python3.10/site-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/home/sd/.local/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(918): call
/home/sd/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py(115): decorate_context
/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py(132): main
/home/sd/Playground/stable-fast/examples/optimize_stable_diffusion_pipeline.py(150):
can anyone at least guid me what could be wrong? Thanks
Hi,
Thanks for your work. I am running the demo test but having this issue when running the inference after successful compile:
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
graph(%1, %2):
%x : Tensor = sfast_triton::reshape(%1, %2)
~~~~~~~~~~~~ <--- HERE
return (%x)
RuntimeError: RuntimeError: shape '[512, 512, 64, 64]' is invalid for input of size 2097152
At:
/opt/sd/lib/python3.10/site-packages/torch/_ops.py(502): __call__
/opt/sd/lib/python3.10/site-packages/sfast/triton/torch_ops.py(82): forward
/opt/sd/lib/python3.10/site-packages/torch/autograd/function.py(506): apply
/opt/sd/lib/python3.10/site-packages/sfast/triton/torch_ops.py(97): reshape
/opt/sd/lib/python3.10/site-packages/torch/nn/modules/module.py(1501): _call_impl
/opt/sd/lib/python3.10/site-packages/sfast/jit/trace_helper.py(111): forward
/opt/sd/lib/python3.10/site-packages/torch/nn/modules/module.py(1501): _call_impl
/opt/sd/lib/python3.10/site-packages/sfast/jit/trace_helper.py(57): wrapper
/opt/sd/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(707): __call__
/opt/sd/lib/python3.10/site-packages/torch/utils/_contextlib.py(115): decorate_context
/hostroot/experiments/sfast/test_sfast.py(70): <module>
Would you help to have a look? Thanks!
What would make it compatible with LCM models?
https://github.com/luosiallen/latent-consistency-model
Hi, another issue that I found is it's not accelerating SDXL. I'm running the demo with A100, the speed of SDXL for the compiled model is 5.3 it/s but with normal diffusers it's 8.8 it/s. The compiled one with stable-fast is slower.
hello,zeyi:
您git上所留的[email protected]邮箱貌似没发发送邮件联系,能否加个微信聊一下,我这边也是做sd加速的算法工程师。
我的邮箱是[email protected]。
Hello, I was able to get this repo working with sdxl turbo on my 4090 using a venv, but I tried to dockerize the build and it repeatedly fails on a missing tmp file. I am wonder if you have seen this issue before and if I perhaps am missing some apt or pip installs that are missing in my docker image causing this issue. Appreciate your help @chengzeyi.
Dockerfile:
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked
--mount=type=cache,target=/var/lib/apt,sharing=locked
apt-get -y update
&& apt-get install -y --no-install-recommends python3.10 python-is-python3 git libgl1 libsndfile1 pip ffmpeg google-perftools
libvulkan1 libnvidia-gl-525-server mesa-vulkan-drivers gcc build-essential
&& apt-get autoremove -y
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip setuptools wheel --no-cache-dir
WORKDIR testing
COPY requirements2.txt requirements2.txt
RUN pip install -r requirements2.txt --no-cache-dir
COPY . .
CMD ["python3", "examples/optimize_stable_diffusion_pipeline.py"]
Error logs:
INFO:root:Tracing forward
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:159: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:218: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:228: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:214: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:23: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return torch.tensor([num], dtype=torch.int64)
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:253: TracerWarning: torch.Tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return super().new(cls, x, *args, **kwargs)
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:123: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
return (torch.as_tensor(tuple(obj), dtype=torch.uint8), )
0%| | 0/30 [00:00<?, ?it/s]INFO:root:Dynamically graphing forward
/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py:88: UserWarning: The CUDA Graph is empty. This usually means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:192.)
super().capture_end()
INFO:root:Tracing forward
/usr/local/lib/python3.10/dist-packages/sfast/utils/flat_tensors.py:197: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bool(tensors[start].item()), start + 1
/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py:878: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if dim % default_overall_up_factor != 0:
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:271: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:173: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py:186: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if hidden_states.shape[0] >= 64:
/tmp/tmpuumtrxyy/main.c:4:10: fatal error: Python.h: No such file or directory
4 | #include <Python.h>
| ^~~~~~~~~~
compilation terminated.
0%| | 0/30 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/testing/examples/optimize_stable_diffusion_pipeline.py", line 81, in
output_image = model(**kwarg_inputs).images[0]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 918, in call
noise_pred = self.unet(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 29, in dynamic_graphed_callable
cached_callable = simple_make_graphed_callable(
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 46, in simple_make_graphed_callable
return make_graphed_callable(callable,
File "/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py", line 75, in make_graphed_callable
callable(*tree_copy(example_inputs),
File "/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py", line 55, in wrapper
return traced_module(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py", line 112, in forward
outputs = self.module(*self.convert_inputs(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
graph(%input, %num_groups, %weight, %bias, %eps, %cudnn_enabled):
%y : Tensor = sfast_triton::group_norm_silu(%input, %num_groups, %weight, %bias, %eps)
~~~~~~~~~~~~ <--- HERE
return (%y)
RuntimeError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpuumtrxyy/main.c', '-O3', '-I/usr/local/lib/python3.10/dist-packages/triton/common/../third_party/cuda/include', '-I/usr/include/python3.10', '-I/tmp/tmpuumtrxyy', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpuumtrxyy/group_norm_4d_channels_last_forward_collect_stats_kernel.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib/x86_64-linux-gnu']' returned non-zero exit status 1.
At:
/usr/lib/python3.10/subprocess.py(369): check_call
/usr/local/lib/python3.10/dist-packages/triton/common/build.py(90): _build
/usr/local/lib/python3.10/dist-packages/triton/compiler/make_launcher.py(39): make_stub
/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py(425): compile
(63): group_norm_4d_channels_last_forward_collect_stats_kernel
/usr/local/lib/python3.10/dist-packages/sfast/triton/init.py(35): new_func
/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py(232): run
/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py(232): run
/usr/local/lib/python3.10/dist-packages/sfast/triton/ops/group_norm.py(437): group_norm_forward
/usr/local/lib/python3.10/dist-packages/sfast/triton/torch_ops.py(186): forward
/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py(539): apply
/usr/local/lib/python3.10/dist-packages/sfast/triton/torch_ops.py(224): group_norm_silu
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py(112): forward
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/sfast/jit/trace_helper.py(55): wrapper
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(75): make_graphed_callable
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(46): simple_make_graphed_callable
/usr/local/lib/python3.10/dist-packages/sfast/cuda/graphs.py(29): dynamic_graphed_callable
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl
/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py(918): call
/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(115): decorate_context
/testing/examples/optimize_stable_diffusion_pipeline.py(81):
Pip Freeze:
accelerate==0.25.0
annotated-types==0.6.0
anyio==3.7.1
certifi==2023.11.17
charset-normalizer==3.3.2
click==8.1.7
diffusers==0.24.0
exceptiongroup==1.2.0
fastapi==0.104.1
filelock==3.13.1
fsspec==2023.12.0
h11==0.14.0
huggingface-hub==0.19.4
idna==3.6
importlib-metadata==7.0.0
Jinja2==3.1.2
MarkupSafe==2.1.3
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
packaging==23.2
Pillow==10.1.0
psutil==5.9.6
pydantic==2.5.2
pydantic_core==2.14.5
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
safetensors==0.4.1
sniffio==1.3.0
stable-fast @ https://github.com/chengzeyi/stable-fast/releases/download/v0.0.12.post6/stable_fast-0.0.12.post6+torch210cu121-cp310-cp310-manylinux2014_x86_64.whl
starlette==0.27.0
sympy==1.12
tokenizers==0.15.0
torch==2.1.0
tqdm==4.66.1
transformers==4.35.2
triton==2.1.0
typing_extensions==4.8.0
urllib3==2.1.0
uvicorn==0.24.0.post1
xformers==0.0.22.post7
zipp==3.17.0
Hi, I was wondering since you have done extensive benchmarking you probably have details on this. It seems that you have built-in support for xformers in there, but I thought pytorch 2.0 introduces an equivalent mechanism built-in? Should I still care about xformers? Cheers
Hi, I would like to know how to input fixed latents to the model during inference. Because I need fixed latents to avoid random noise for comparing the results of stable fast and PyTorch. I have set the input parameters to the following format, but it doesn't work.
`latents = np.load('latents.npy', mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')
latents = torch.from_numpy(latents).half().cuda()
kwarg_inputs = dict(
prompt = prompt,
latents = latents,
height=512,
width=512,
num_inference_steps=20,
num_images_per_prompt=1,
)
output_image = compiled_model(**kwarg_inputs).images[0]
hi, i was trying this out for maximum optimization in aws G5 instance on ubuntu (it's just an nvidia A10g) and i was using comfy ui by calling on the nodes itself in python code and i kept getting this error message that i coudn't solve.
how would i be able to resolve this?
all the dependencies regarding the pytorch and 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0' 'torch>=1.12.0' was met and it worked on my desktop but does not in ubuntu.
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:157: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
obj_type = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:216: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:226: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
size = tensors[start].item()
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:212: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return bytes(tensors[start].tolist()), start + 1
/opt/conda/lib/python3.10/site-packages/sfast/utils/flat_tensors.py:203: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return int(tensors[start].item()), start + 1
0%| | 0/12 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ubuntu/test/workflow_clip_sdxl2.py", line 320, in
main()
File "/home/ubuntu/test/workflow_clip_sdxl2.py", line 229, in main
ksampler_3 = ksampler.sample(
File "/home/ubuntu/ComfyUI/nodes.py", line 1286, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
File "/home/ubuntu/ComfyUI/nodes.py", line 1256, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI-Impact-Pack/modules/impact/sample_error_enhancer.py", line 22, in informative_sample
raise e
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI-Impact-Pack/modules/impact/sample_error_enhancer.py", line 9, in informative_sample
return original_sample(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/sample.py", line 100, in sample
samples = sampler.sample(noise, positive_copy, negative_copy, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 711, in sample
return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 617, in sample
samples = sampler.sample(model_wrap, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 556, in sample
samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/k_diffusion/sampling.py", line 137, in sample_euler
denoised = model(x, sigma_hat * s_in, **extra_args)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 277, in forward
out = self.inner_model(x, sigma, cond=cond, uncond=uncond, cond_scale=cond_scale, model_options=model_options, seed=seed)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 267, in forward
return self.apply_model(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 264, in apply_model
out = sampling_function(self.inner_model, x, timestep, uncond, cond, cond_scale, model_options=model_options, seed=seed)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 252, in sampling_function
cond, uncond = calc_cond_uncond_batch(model, cond, uncond, x, timestep, model_options)
File "/home/ubuntu/ComfyUI/comfy/samplers.py", line 228, in calc_cond_uncond_batch
output = model_options['model_function_wrapper'](model.apply_model, {"input": input_x, "timestep": timestep, "c": c, "cond_or_uncond": cond_or_uncond}).chunk(batch_chunks)
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI_stable_fast/node.py", line 69, in call
return self.stable_fast_model.get_traced_module(input_x, timestep, **c)[0](
File "/home/ubuntu/ComfyUI/custom_nodes/ComfyUI_stable_fast/module/stable_diffusion_pipeline_compiler.py", line 62, in get_traced_module
traced_m, call_helper = trace_with_kwargs(
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 23, in trace_with_kwargs
traced_module = better_trace(TraceablePosArgOnlyModuleWrapper(func),
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/utils.py", line 29, in better_trace
script_module = torch.jit.trace(func, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/jit/_trace.py", line 798, in trace
return trace_module(
File "/opt/conda/lib/python3.10/site-packages/torch/jit/_trace.py", line 1065, in trace_module
module._c._create_method_from_trace(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 127, in forward
outputs = self.module(*orig_args, **orig_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sfast/jit/trace_helper.py", line 77, in forward
return self.func(*args, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/model_base.py", line 68, in apply_model
model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 619, in forward
h = forward_timestep_embed(module, h, emb, context, transformer_options)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 35, in forward_timestep_embed
x = layer(x, emb)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1508, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 210, in forward
return checkpoint(
File "/home/ubuntu/ComfyUI/comfy/ldm/modules/diffusionmodules/util.py", line 121, in checkpoint
return CheckpointFunction.apply(func, len(inputs), *args)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
RuntimeError: _Map_base::at
Hi chengzeyi,
Wanted to first off congratulate you on this awesome work! I have actually also been working on a similar project here but I have recently stopped development since your project has already been widely adopted. However, there are some features that I have been working on that I believe could enhance stable-fast
I would love to work with you on the above topics if possible since I have already partially implemented quite a few of these! Please let me know if you could see us collaborating in the future..
Using Linux 22.04, Torch 2., diffusers 0.22.1, xformers 0.0.22.post7, triton 2.1
I've posted the entire log of the build hoping it helps isolate what I'm doing wrong.
Using pip 23.3.1 from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/pip (python 3.10)
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting stable-fast
Cloning https://github.com/chengzeyi/stable-fast.git (to revision main) to /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
Running command git version
git version 2.34.1
Running command git clone --filter=blob:none https://github.com/chengzeyi/stable-fast.git /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
Cloning into '/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273'...
Running command git show-ref main
52afadfe0f49b2aa13e9ac15566cedb4e0732784 refs/heads/main
52afadfe0f49b2aa13e9ac15566cedb4e0732784 refs/remotes/origin/main
Running command git symbolic-ref -q HEAD
refs/heads/main
Resolved https://github.com/chengzeyi/stable-fast.git to commit 52afadfe0f49b2aa13e9ac15566cedb4e0732784
Running command git rev-parse HEAD
52afadfe0f49b2aa13e9ac15566cedb4e0732784
Running command python setup.py egg_info
running egg_info
creating /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info
writing /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/dependency_links.txt
writing requirements to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/requires.txt
writing top-level names to /tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/top_level.txt
writing manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
reading manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file '/tmp/pip-pip-egg-info-q56s4o0e/stable_fast.egg-info/SOURCES.txt'
Preparing metadata (setup.py) ... done
Requirement already satisfied: packaging in ./venv/lib/python3.10/site-packages (from stable-fast) (23.2)
Requirement already satisfied: torch>=1.12.0 in ./venv/lib/python3.10/site-packages (from stable-fast) (2.1.0)
Requirement already satisfied: filelock in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.13.1)
Requirement already satisfied: typing-extensions in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (4.8.0)
Requirement already satisfied: sympy in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (1.12)
Requirement already satisfied: networkx in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.2.1)
Requirement already satisfied: jinja2 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (3.1.2)
Requirement already satisfied: fsspec in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2023.10.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (8.9.2.26)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (11.0.2.54)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (10.3.2.106)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (11.4.5.107)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.0.106)
Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2.18.1)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (12.1.105)
Requirement already satisfied: triton==2.1.0 in ./venv/lib/python3.10/site-packages (from torch>=1.12.0->stable-fast) (2.1.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in ./venv/lib/python3.10/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.12.0->stable-fast) (12.3.52)
Requirement already satisfied: MarkupSafe>=2.0 in ./venv/lib/python3.10/site-packages (from jinja2->torch>=1.12.0->stable-fast) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in ./venv/lib/python3.10/site-packages (from sympy->torch>=1.12.0->stable-fast) (1.3.0)
Building wheels for collected packages: stable-fast
Running command python setup.py bdist_wheel
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-310
creating build/lib.linux-x86_64-cpython-310/sfast
copying sfast/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast
creating build/lib.linux-x86_64-cpython-310/sfast/jit
copying sfast/jit/trace_helper.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
copying sfast/jit/utils.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
copying sfast/jit/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/jit
creating build/lib.linux-x86_64-cpython-310/sfast/compilers
copying sfast/compilers/stable_diffusion_pipeline_compiler.py -> build/lib.linux-x86_64-cpython-310/sfast/compilers
copying sfast/compilers/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/compilers
creating build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/aot_printer.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/patch.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/copy_func.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/custom_python_operator.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/torch_dispatch.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/flat_tensors.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/gpu_device.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/memory_format.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/xformers_attention.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/env.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
copying sfast/utils/compute_precision.py -> build/lib.linux-x86_64-cpython-310/sfast/utils
creating build/lib.linux-x86_64-cpython-310/sfast/triton
copying sfast/triton/torch_ops.py -> build/lib.linux-x86_64-cpython-310/sfast/triton
copying sfast/triton/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton
creating build/lib.linux-x86_64-cpython-310/sfast/dynamo
copying sfast/dynamo/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo
creating build/lib.linux-x86_64-cpython-310/sfast/cuda
copying sfast/cuda/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/cuda
copying sfast/cuda/graphs.py -> build/lib.linux-x86_64-cpython-310/sfast/cuda
creating build/lib.linux-x86_64-cpython-310/sfast/jit/passes
copying sfast/jit/passes/triton_passes.py -> build/lib.linux-x86_64-cpython-310/sfast/jit/passes
copying sfast/jit/passes/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/jit/passes
creating build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
copying sfast/utils/term_image/image_to_ansi.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
copying sfast/utils/term_image/imgcat.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
copying sfast/utils/term_image/kdtree.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
copying sfast/utils/term_image/climage.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
copying sfast/utils/term_image/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/utils/term_image
creating build/lib.linux-x86_64-cpython-310/sfast/triton/modules
copying sfast/triton/modules/patch.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
copying sfast/triton/modules/native.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
copying sfast/triton/modules/diffusers.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
copying sfast/triton/modules/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/modules
creating build/lib.linux-x86_64-cpython-310/sfast/triton/ops
copying sfast/triton/ops/group_norm.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
copying sfast/triton/ops/copy.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
copying sfast/triton/ops/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
copying sfast/triton/ops/activation.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
copying sfast/triton/ops/conv.py -> build/lib.linux-x86_64-cpython-310/sfast/triton/ops
creating build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
copying sfast/dynamo/backends/sfast_jit.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
copying sfast/dynamo/backends/registry.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
copying sfast/dynamo/backends/__init__.py -> build/lib.linux-x86_64-cpython-310/sfast/dynamo/backends
running build_ext
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.3) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.3
warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'sfast._C' extension
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas
creating /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn
Emitting ninja build file /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/13] /usr/local/cuda/bin/nvcc -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
FAILED: /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o
/usr/local/cuda/bin/nvcc -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
In file included from /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution_impl.cu:7:
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/ATen/cudnn/cudnn-wrapper.h:3:10: fatal error: cudnn.h: No such file or directory
3 | #include <cudnn.h>
| ^~~~~~~~~
compilation terminated.
[2/13] /usr/local/cuda/bin/nvcc -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --extended-lambda -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu(291): warning #177-D: variable "falpha" was declared but never referenced
float falpha = alpha;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/CUDABlas.cu(292): warning #177-D: variable "fbeta" was declared but never referenced
float fbeta = beta;
^
[3/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/misc.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[4/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/scalar_tensor_erase.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[5/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/fused_linear.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[6/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/op_input_tensor_conversion.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[7/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/device_constant_override.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[8/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/compilation_unit.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[9/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/main.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[10/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
In file included from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/utils/python_arg_parser.h:65,
from /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/jit/python/pybind_utils.h:26,
from /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/python_operator.cpp:10:
/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/utils/python_strings.h:104:19: warning: ‘pybind11::object PyObject_FastGetAttrString(PyObject*, const char*)’ defined but not used [-Wunused-function]
104 | static py::object PyObject_FastGetAttrString(PyObject* obj, const char* name) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
[11/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cudnn/cudnn_convolution.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[12/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/jit/init.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
[13/13] c++ -MMD -MF /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DWITH_CUDA -I/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/TH -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/include -I/usr/include/python3.10 -c -c /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp -o /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/build/temp.linux-x86_64-cpython-310/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp:641:55: warning: "/*" within comment [-Wcomment]
641 | beta_, self.scalar_type(), c10::nullopt /* layout */
|
/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/sfast/csrc/operators/cublas/cublas_gemm.cpp:642:28: warning: "/*" within comment [-Wcomment]
642 | /*, at::kCPU, c10::nullopt /* pin_memory */ /*));
|
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/setup.py", line 109, in <module>
setup(
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
return distutils.core.setup(**attrs)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 369, in run
self.run_command("build")
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 131, in run
self.run_command(cmd_name)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 88, in run
_build_ext.run(self)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
self.build_extensions()
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
build_ext.build_extensions(self)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
self._build_extensions_serial()
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
self.build_extension(ext)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 249, in build_extension
_build_ext.build_extension(self, ext)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
super(build_ext, self).build_extension(ext)
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
objects = self.compiler.compile(
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
full command: /media/drakosfire/Shared/StatBlockGenerator/linux-build/venv/bin/python3 -u -c '
exec(compile('"'"''"'"''"'"'
# This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
#
# - It imports setuptools before invoking setup.py, to enable projects that directly
# import from `distutils.core` to work with newer packaging standards.
# - It provides a clear error message when setuptools is not installed.
# - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
# setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
# manifest_maker: standard file '"'"'-c'"'"' not found".
# - It generates a shim setup.py, for handling setup.cfg-only projects.
import os, sys, tokenize
try:
import setuptools
except ImportError as error:
print(
"ERROR: Can not execute `setup.py` since setuptools is not available in "
"the build environment.",
file=sys.stderr,
)
sys.exit(1)
__file__ = %r
sys.argv[0] = __file__
if os.path.exists(__file__):
filename = __file__
with tokenize.open(__file__) as f:
setup_py_code = f.read()
else:
filename = "<auto-generated setuptools caller>"
setup_py_code = "from setuptools import setup; setup()"
exec(compile(setup_py_code, filename, "exec"))
'"'"''"'"''"'"' % ('"'"'/tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' bdist_wheel -d /tmp/pip-wheel-_zs0b_wo
cwd: /tmp/pip-install-hjxxp7h6/stable-fast_9daa074108504c198a8d01018610b273/
Building wheel for stable-fast (setup.py) ... error
ERROR: Failed building wheel for stable-fast
Running setup.py clean for stable-fast
Running command python setup.py clean
running clean
removing 'build/temp.linux-x86_64-cpython-310' (and everything under it)
removing 'build/lib.linux-x86_64-cpython-310' (and everything under it)
'build/bdist.linux-x86_64' does not exist -- can't clean it
'build/scripts-3.10' does not exist -- can't clean it
removing 'build'
Failed to build stable-fast
ERROR: Could not build wheels for stable-fast, which is required to install pyproject.toml-based projects
Hello, I see that in the read me that stable-fast
supports LoRAs out of the box. Does anyone know if it's possible to use Lycoris LoRAs with this package? The Nvidia TensorRT extension for A1111 seems have questionable support for them.
Hi @chengzeyi, I sent an email to you last week but I wanted to try again, we'd love to get in touch with you since Vlad has now applied your code to SDNext, we're seeing great results on our dev branch.
Hit us up on our discord server or just reply to my email so we can connect, as we plan on posting our results.
Thanks!
Thank you for sharing! I would like to ask if this result can be applied to loading and inference with Lora? Thank you!
The following problem occurred when I call compile(pipe)
Traceback (most recent call last):
File "svd_sf.py", line 49, in
frames = pipe(image, decode_chunk_size=7, num_frames=20).frames[0]
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py", line 499, in call
noise_pred = self.unet(
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 40, in dynamic_graphed_callable
cached_callable = simple_make_graphed_callable(
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 61, in simple_make_graphed_callable
return make_graphed_callable(func,
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 90, in make_graphed_callable
func(*tree_copy(example_inputs, detach=True),
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/jit/trace_helper.py", line 64, in wrapper
return traced_module(*args, **kwargs)
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/jit/trace_helper.py", line 133, in forward
outputs = self.module(*self.convert_inputs(args, kwargs))
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
graph(%1, %2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15):
%x = sfast::cudnn_convolution_bias_add(%1, %2, %3, %14, %15, %4, %5, %6, %7, %8, %9)
~~~~~ <--- HERE
return (%x)
RuntimeError: no valid convolution algorithms available in CuDNN
I have cudnn installed on my server. torch.backends.cudnn.is_available()
and torch.backends.cudnn.enabled
show True
Updated: I successfully ran your example in README.md
. Currently I'm trying to accelerate stable video diffusion, which involves very large matmul. So possibly this is the reason?
I am using the Stable-Fast compilation to use SD2.1 models for multiple aspect ratios. It takes about 50 seconds to infer on each size and then speeds up inference.
Is there a way to speed up the process?
Environment Information:
Problem Description:
When trying to install the stable-fast
package, I encounter a ModuleNotFoundError: No module named 'torch'
. I have confirmed that torch
is installed in the virtual environment using pip list
. Also, no package conflicts were detected with pip check
.
Attempted Solutions:
torch
in the virtual environment.Additional Information:
The same error occurs when installing directly from the GitHub repository and from PyPI.
I would appreciate any suggestions or solutions to this problem.
First of all, congrats on the release of such an amazing project!! Reading through the list of individual optimizations / 'features', I am super curious about how much each one of them contribute to the total performance increase and whether there is an reasonably smaller set of operations that can yield 85-90% of the whole perf while reducing the surface area by a lot. While developing stable fast, even though they might be unconfirmed, have any numbers been collected between new optimization passes?
Releases after 12.0, aka 12.0 post1-12.0 post 3 and the nightly generate black images with the same dependencies as 12.0
Using the NV CUDA 12.1 container as base:
requirements.txt
diffusers==0.23.1
xformers
torch==2.1.0
nvidia-pytriton==0.4.1
numpy==1.26.2
triton==2.1.0
requests
transformers==4.35.2
tokenizers==0.15.0
https://github.com/chengzeyi/stable-fast/releases/download/v0.0.12/stable_fast-0.0.12_torch210_cu121-cp310-cp310-manylinux2014_x86_64.whl
Hi, I'm running SD 2.1, and got the following errors. How can I fix it?
File "/zhangjun/git/stable-fast/src/sfast/jit/trace_helper.py", line 78, in forward
return self.func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/diffusers/schedulers/scheduling_euler_ancestral_discrete.py", line 356, in step
noise = randn_tensor(model_output.shape, dtype=model_output.dtype, device=device, generator=generator)
File "/opt/conda/lib/python3.10/site-packages/diffusers/utils/torch_utils.py", line 80, in randn_tensor
latents = torch.randn(shape, generator=generator, device=rand_device, dtype=dtype, layout=layout).to(device)
RuntimeError: Found an unsupported argument type in the JIT tracer: at::Generator. File a bug report.
Config options as follow.
config = CompilationConfig.Default()
try:
import xformers
config.enable_xformers = True
except ImportError:
print('xformers not installed, skip')
try:
import triton
config.enable_triton = True
torch.backends.cuda.matmul.allow_tf32 = True
except ImportError:
print('Triton not installed, skip')
config.enable_cuda_graph = True
config.trace_scheduler = True
Inference code:
scheduler = EulerAncestralDiscreteScheduler.from_pretrained(
model_path, subfolder="scheduler"
)
pipeline = StableDiffusionPipeline.from_pretrained(
model_path,
scheduler=scheduler,
torch_dtype=self.dtype,
safety_checker=None,
)
# compile
...
pipeline = compile(pipeline, config)
sfast_inputs = dict(
prompt=prompt,
negative_prompt=neg_prompt,
generator=torch.Generator(device='cuda').manual_seed(seed),
width=width,
height=height,
num_inference_steps=num_inference_steps,
)
image = pipeline(**sfast_inputs).images
Environment Information:
Do you think any of these optimizations could be applied to Stable Video Diffusion? I'd like to help here if possible.
First, awesome work. The speed improvement is great, and the additional cold start time is very manageable.
Second, I load many pipelines for each model, and do things like swap out controlnets and vaes in response to user requests. When I swap controlnets and run a generation, I am getting the error RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float
. I'm guessing this is because the replacement controlnet hasn't had the same optimizations as the original. Is there a way to compile just the controlnets so that they can be swapped in and out of the pipeline dynamically?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.