Running Speed is Slower for SDXL Model about stable-fast HOT 11 CLOSED

alecyan1993 commented on May 22, 2024

Running Speed is Slower for SDXL Model

from stable-fast.

Comments (11)

chengzeyi commented on May 22, 2024

I have observed same performance regression when tested on my PC.
Initially I thought it was caused by insufficient GPU VRAM.
But as A100 has a relatively large VARM, it should be caused by other restrictions or bugs.

from stable-fast.

chengzeyi commented on May 22, 2024

Hi, another issue that I found is it's not accelerating SDXL. I'm running the demo with A100, the speed of SDXL for the compiled model is 5.3 it/s but with normal diffusers it's 8.8 it/s. The compiled one with stable-fast is slower.

I still think it's because of insufficient vram. Could you please share more info about your system and inference configuration? I want to know the peak vram utilization during inference and your image resolution.

from stable-fast.

chengzeyi commented on May 22, 2024

Do you happen to run SDXL on WSL or on Windows or other operating systems that support shared VRAM?

I think I have found the reason: On systems that have shared VRAM functionality support, NVIDIA drivers choose to dispatch memory allocation requests to shared VRAM instead of throwing an OOM error when GPU VRAM is insufficient because the model is too large or the resolution is too high, or some leaks caused by PyTorch make the previous allocated memory cannot be released.

And shared VRAM is just indeed the system memory that your computer has. It is thousands of times slower than the dedicated VRAM on board, resulting in slow performance of the inference, even if only a few layers and intermediate buffers are put into shared VRAM.

from stable-fast.

alecyan1993 commented on May 22, 2024

Hi,

Thanks so much for your reply. I'm running on Ubuntu 20.04 within the docker container with --gpu-all when running the docker image. The GPU that I used is A100 40G so should be enough VRAM for running SDXL model.

from stable-fast.

chengzeyi commented on May 22, 2024

Hi,

Thanks so much for your reply. I'm running on Ubuntu 20.04 within the docker container with --gpu-all when running the docker image. The GPU that I used is A100 40G so should be enough VRAM for running SDXL model.

That's really weird. On my system I can make 90% sure that this problem should be caused by the VRAM offloading mechanism of the NVIDIA driver on Windows. But I don't have GPUs with large VRAM like A100 to test on, so it is hard to debug.

from stable-fast.

alecyan1993 commented on May 22, 2024

Do you have any debugging script so I can have some tests on my instance?

from stable-fast.

chengzeyi commented on May 22, 2024

Do you have any debugging script so I can have some tests on my instance?

The following script should work.
Detailed performance analysis result could be exported by nsight-system.

import torch
from diffusers import (StableDiffusionXLPipeline, EulerAncestralDiscreteScheduler)
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile,
                                                                CompilationConfig
                                                                )

def load_model():
    # NOTE:
    # You could change to StableDiffusionXLPipeline to load SDXL model.
    # If the resolution is high (1024x1024),
    # ensure you VRAM is sufficient (or RAM? I'm not sure, maybe I should upgrade my PC).
    # Or the performance might regress.
    model = StableDiffusionXLPipeline.from_pretrained('stabilityai/stable-diffusion-xl-base-1.0', torch_dtype=torch.float16)

    model.scheduler = EulerAncestralDiscreteScheduler.from_config(
        model.scheduler.config)
    model.safety_checker = None
    model.to(torch.device('cuda'))
    return model

model = load_model()

config = CompilationConfig.Default()

# xformers and Triton are suggested for achieving best performance.
# It might be slow for Triton to generate, compile and fine-tune kernels.
try:
    import xformers
    config.enable_xformers = True
except ImportError:
    print('xformers not installed, skip')
# NOTE:
# When GPU VRAM is insufficient or the architecture is too old, Triton might be slow.
# Disable Triton if you encounter this problem.
try:
    import triton
    config.enable_triton = True
except ImportError:
    print('Triton not installed, skip')
# NOTE:
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# My implementation can handle dynamic shape with increased need for GPU memory.
# But when your GPU VRAM is insufficient or the image resolution is high,
# CUDA Graph could cause less efficient VRAM utilization and slow down the inference.
# If you meet problems related to it, you should disable it.
config.enable_cuda_graph = True

compiled_model = compile(model, config)

kwarg_inputs = dict(
    prompt=
    '(masterpiece:1,2), best quality, masterpiece, best detail face, lineart, monochrome, a beautiful girl',
    # NOTE: If you use SDXL, you should use a higher resolution to improve the generation quality.
    height=1024,
    width=1024,
    num_inference_steps=30,
    num_images_per_prompt=1,
)

# NOTE: Warm it up.
# The first call will trigger compilation and might be very slow.
# After the first call, it should be very fast.
output_image = compiled_model(**kwarg_inputs).images[0]

# Let's see the second call!
output_image = compiled_model(**kwarg_inputs).images[0]

from stable-fast.

chengzeyi commented on May 22, 2024

Do you have any debugging script so I can have some tests on my instance?

Information of current PyTorch environment can be collected as below.

python -m torch.utils.collect_env

from stable-fast.

SuperSecureHuman commented on May 22, 2024

In my case, it was close to 10 to 12 it/s (30 steps)

A100 is already fast enough to see any much compilation improvements (?)

Stock SDXL - 3.87 Sec
Manual Torch Compile - 3.43 Sec
Compiled Fast SDXL - 3.3 Sec

I am not putting iterations per second numbers, because they vary too much. It starts off at really really high, and ends up low.

But I notice that, the initial set of iterations go faster than stock.

from stable-fast.

chengzeyi commented on May 22, 2024

In my case, it was close to 10 to 12 it/s (30 steps)

A100 is already fast enough to see any much compilation improvements (?)

Stock SDXL - 3.87 Sec Manual Torch Compile - 3.43 Sec Compiled Fast SDXL - 3.3 Sec

I am not putting iterations per second numbers, because they vary too much. It starts off at really really high, and ends up low.

But I notice that, the initial set of iterations go faster than stock.

The printed table result could be incorrect due to some mysterious bug of Python's cProfile (maybe?) and could cause a relative high CPU overhead. I don't know how to solve it. Maybe a plain time.time() should be a good replacement?

from stable-fast.

chengzeyi commented on May 22, 2024

A100 80GB could have very impressive speed. About six months ago, I could achieve a generation speed of 61.8 it/s on A100. However, to achieve this I need to use a modified version of scheduler to reduce CPU overhead and this is conflict with my wish that users could use any scheduler that they want. So I have to sacrifice and we also disable some further optimizations now to let users be able to switch LoRA dynamically.

Triton autotune is another important technique to make the kernel run faster. But as so many people want the compilation to be fast it is also replaced by heuristics.😂

from stable-fast.

Running Speed is Slower for SDXL Model about stable-fast HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent