I think torch.compile should work on HF diffusers: <a href="https://huggingface.co/doc

Thanks for the benchmarks! I wonder if the tile sizes for <code clas

Out of curiosity, what's the performance compared to torch.compile? about stable-fast HOT 8 CLOSED

Chillee commented on May 18, 2024

Out of curiosity, what's the performance compared to torch.compile?

from stable-fast.

Comments (8)

chengzeyi commented on May 18, 2024 4

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

Accurate benchmark has been conducted for 4090 and 3090 by myself today.

Performance varies very greatly across different hardware/software/platform/driver configurations.
It is very hard to benchmark accurately. And preparing the environment for benchmarking is also a hard job.
I have tested on some platforms before but the results may still be inaccurate.

currently A100 is hard and expensive to rent from cloud server providers in my region.

Benchmark results will be available when I have the access to A100 again.

RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)

Framework	SD 1.5	SD 2.1	SD 1.5 ControlNet
Vanilla PyTorch (2.1.0+cu118)	24.9 it/s	27.1 it/s	18.9 it/s
torch.compile (2.1.0+cu118, NHWC UNet)	33.5 it/s	38.2 it/s	22.7 it/s
AITemplate	65.7 it/s	71.6 it/s	untested
OneFlow	60.1 it/s	12.9 it/s (??)	untested
TensorRT	untested	untested	untested
Stable Fast (with xformers & triton)	61.8 it/s	61.6 it/s	42.3 it/s

RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)

Framework	SD 1.5
Vanilla PyTorch (2.1.0+cu118)	22.5 it/s
torch.compile (2.1.0+cu118, NHWC UNet)	25.3 it/s
AITemplate	34.6 it/s
OneFlow	38.8 it/s
TensorRT	untested
Stable Fast (with xformers & triton)	31.5 it/s

from stable-fast.

isidentical commented on May 18, 2024

I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc.

On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get:

-> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s
-> SD1.5 + torch.compile is ~51it/s
-> SD1.5 + stable-fast is ~55it/s

from stable-fast.

chengzeyi commented on May 18, 2024

I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc.

On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get:

-> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s -> SD1.5 + torch.compile is ~51it/s -> SD1.5 + stable-fast is ~55it/s

Yes, in this doc, the performance of torch.compile on hf diffusers is discussed in detail.

https://huggingface.co/docs/diffusers/optimization/torch2.0

from stable-fast.

Chillee commented on May 18, 2024

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

from stable-fast.

isidentical commented on May 18, 2024

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

from stable-fast.

chengzeyi commented on May 18, 2024

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

Commercial GPUs are expensive and hard to get in my country.
We also have strict international Internet connection limitations here.

So testing cutting-edge ML models is not an easy task for me, please wait.

from stable-fast.

Chillee commented on May 18, 2024

Thanks for the benchmarks!

I wonder if the tile sizes for torch.compile are just very untuned for consumer hardware 🤔

If you happen to have some more time, could you try:

making sure you turn on mode="reduce-overhead" for torch.compile.
Try running with TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1

Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see torch.compile performance on consumer hardware.

from stable-fast.

chengzeyi commented on May 18, 2024

Thanks for the benchmarks!

I wonder if the tile sizes for torch.compile are just very untuned for consumer hardware 🤔

If you happen to have some more time, could you try:

making sure you turn on mode="reduce-overhead" for torch.compile.

Try running with TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1

Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see torch.compile performance on consumer hardware.

In my own development environment, with 'reduce-overhead', the model just generates buggy outputs...

from stable-fast.

Out of curiosity, what's the performance compared to torch.compile? about stable-fast HOT 8 CLOSED

Comments (8)

RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)

RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent