Comments (11)
I have observed same performance regression when tested on my PC.
Initially I thought it was caused by insufficient GPU VRAM.
But as A100 has a relatively large VARM, it should be caused by other restrictions or bugs.
from stable-fast.
Hi, another issue that I found is it's not accelerating SDXL. I'm running the demo with A100, the speed of SDXL for the compiled model is 5.3 it/s but with normal diffusers it's 8.8 it/s. The compiled one with stable-fast is slower.
I still think it's because of insufficient vram. Could you please share more info about your system and inference configuration? I want to know the peak vram utilization during inference and your image resolution.
from stable-fast.
Do you happen to run SDXL on WSL or on Windows or other operating systems that support shared VRAM?
I think I have found the reason: On systems that have shared VRAM functionality support, NVIDIA drivers choose to dispatch memory allocation requests to shared VRAM instead of throwing an OOM error when GPU VRAM is insufficient because the model is too large or the resolution is too high, or some leaks caused by PyTorch make the previous allocated memory cannot be released.
And shared VRAM is just indeed the system memory that your computer has. It is thousands of times slower than the dedicated VRAM on board, resulting in slow performance of the inference, even if only a few layers and intermediate buffers are put into shared VRAM.
from stable-fast.
Hi,
Thanks so much for your reply. I'm running on Ubuntu 20.04 within the docker container with --gpu-all when running the docker image. The GPU that I used is A100 40G so should be enough VRAM for running SDXL model.
from stable-fast.
Hi,
Thanks so much for your reply. I'm running on Ubuntu 20.04 within the docker container with --gpu-all when running the docker image. The GPU that I used is A100 40G so should be enough VRAM for running SDXL model.
That's really weird. On my system I can make 90% sure that this problem should be caused by the VRAM offloading mechanism of the NVIDIA driver on Windows. But I don't have GPUs with large VRAM like A100 to test on, so it is hard to debug.
from stable-fast.
Do you have any debugging script so I can have some tests on my instance?
from stable-fast.
Do you have any debugging script so I can have some tests on my instance?
The following script should work.
Detailed performance analysis result could be exported by nsight-system
.
import torch
from diffusers import (StableDiffusionXLPipeline, EulerAncestralDiscreteScheduler)
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile,
CompilationConfig
)
def load_model():
# NOTE:
# You could change to StableDiffusionXLPipeline to load SDXL model.
# If the resolution is high (1024x1024),
# ensure you VRAM is sufficient (or RAM? I'm not sure, maybe I should upgrade my PC).
# Or the performance might regress.
model = StableDiffusionXLPipeline.from_pretrained('stabilityai/stable-diffusion-xl-base-1.0', torch_dtype=torch.float16)
model.scheduler = EulerAncestralDiscreteScheduler.from_config(
model.scheduler.config)
model.safety_checker = None
model.to(torch.device('cuda'))
return model
model = load_model()
config = CompilationConfig.Default()
# xformers and Triton are suggested for achieving best performance.
# It might be slow for Triton to generate, compile and fine-tune kernels.
try:
import xformers
config.enable_xformers = True
except ImportError:
print('xformers not installed, skip')
# NOTE:
# When GPU VRAM is insufficient or the architecture is too old, Triton might be slow.
# Disable Triton if you encounter this problem.
try:
import triton
config.enable_triton = True
except ImportError:
print('Triton not installed, skip')
# NOTE:
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# My implementation can handle dynamic shape with increased need for GPU memory.
# But when your GPU VRAM is insufficient or the image resolution is high,
# CUDA Graph could cause less efficient VRAM utilization and slow down the inference.
# If you meet problems related to it, you should disable it.
config.enable_cuda_graph = True
compiled_model = compile(model, config)
kwarg_inputs = dict(
prompt=
'(masterpiece:1,2), best quality, masterpiece, best detail face, lineart, monochrome, a beautiful girl',
# NOTE: If you use SDXL, you should use a higher resolution to improve the generation quality.
height=1024,
width=1024,
num_inference_steps=30,
num_images_per_prompt=1,
)
# NOTE: Warm it up.
# The first call will trigger compilation and might be very slow.
# After the first call, it should be very fast.
output_image = compiled_model(**kwarg_inputs).images[0]
# Let's see the second call!
output_image = compiled_model(**kwarg_inputs).images[0]
from stable-fast.
Do you have any debugging script so I can have some tests on my instance?
Information of current PyTorch environment can be collected as below.
python -m torch.utils.collect_env
from stable-fast.
In my case, it was close to 10 to 12 it/s (30 steps)
A100 is already fast enough to see any much compilation improvements (?)
Stock SDXL - 3.87 Sec
Manual Torch Compile - 3.43 Sec
Compiled Fast SDXL - 3.3 Sec
I am not putting iterations per second numbers, because they vary too much. It starts off at really really high, and ends up low.
But I notice that, the initial set of iterations go faster than stock.
from stable-fast.
In my case, it was close to 10 to 12 it/s (30 steps)
A100 is already fast enough to see any much compilation improvements (?)
Stock SDXL - 3.87 Sec Manual Torch Compile - 3.43 Sec Compiled Fast SDXL - 3.3 Sec
I am not putting iterations per second numbers, because they vary too much. It starts off at really really high, and ends up low.
But I notice that, the initial set of iterations go faster than stock.
The printed table result could be incorrect due to some mysterious bug of Python's cProfile
(maybe?) and could cause a relative high CPU overhead. I don't know how to solve it. Maybe a plain time.time()
should be a good replacement?
from stable-fast.
A100 80GB could have very impressive speed. About six months ago, I could achieve a generation speed of 61.8 it/s on A100. However, to achieve this I need to use a modified version of scheduler to reduce CPU overhead and this is conflict with my wish that users could use any scheduler that they want. So I have to sacrifice and we also disable some further optimizations now to let users be able to switch LoRA dynamically.
Triton autotune is another important technique to make the kernel run faster. But as so many people want the compilation to be fast it is also replaced by heuristics.😂
from stable-fast.
Related Issues (20)
- Inference Error for with Dynamix Resolution HOT 1
- accelent the stablediffusion with lora and controlnet? HOT 2
- FP8 support in stable fast HOT 6
- Installation failed with an error HOT 2
- How can I load the whole model from the compiled one instead of loading only the unet to current existing sd model?
- Issue of too much recompilation after modification of cross attention processor HOT 1
- can stable fast used with diffusers device_map='auto' HOT 1
- v1.0.4 release contains the prebuilt binary for 1.0.3
- Triton group normalization is slower than torch ops in some case. HOT 1
- xformers 25 removes Triton fmha and completely breaks stable-fast, with fix (MemoryEfficientAttentionTritonFwdFlashBwOp, TritonFlashAttentionOp)
- 切换ControlNets HOT 1
- Changing working device will cause problem HOT 3
- [Bug] MemoryEfficientAttentionTritonFwdFlashBwOp no longer availale HOT 1
- opening enable_cuda_graph only support two different size image when reference HOT 1
- Support IP-Adapters
- enable_cuda_graph 影响Lora的切换 HOT 1
- 请问一下谁可以共享一下在config.enable_cuda_graph = True的情况下切换Lora已经可用的Demo嘛 HOT 7
- Install fatal error C1083: windows11 HOT 6
- Stable-fast compatibility with lightning models HOT 1
- gcc "-std=99" may need to be declared in code
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stable-fast.