Comments (8)
So it's about on par with TensorRT but a bit slower than OneFlow and this repo?
Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.
Accurate benchmark has been conducted for 4090 and 3090 by myself today.
Performance varies very greatly across different hardware/software/platform/driver configurations.
It is very hard to benchmark accurately. And preparing the environment for benchmarking is also a hard job.
I have tested on some platforms before but the results may still be inaccurate.
currently A100 is hard and expensive to rent from cloud server providers in my region.
Benchmark results will be available when I have the access to A100 again.
RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)
Framework | SD 1.5 | SD 2.1 | SD 1.5 ControlNet |
---|---|---|---|
Vanilla PyTorch (2.1.0+cu118) | 24.9 it/s | 27.1 it/s | 18.9 it/s |
torch.compile (2.1.0+cu118, NHWC UNet) | 33.5 it/s | 38.2 it/s | 22.7 it/s |
AITemplate | 65.7 it/s | 71.6 it/s | untested |
OneFlow | 60.1 it/s | 12.9 it/s (??) | untested |
TensorRT | untested | untested | untested |
Stable Fast (with xformers & triton) | 61.8 it/s | 61.6 it/s | 42.3 it/s |
RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)
Framework | SD 1.5 |
---|---|
Vanilla PyTorch (2.1.0+cu118) | 22.5 it/s |
torch.compile (2.1.0+cu118, NHWC UNet) | 25.3 it/s |
AITemplate | 34.6 it/s |
OneFlow | 38.8 it/s |
TensorRT | untested |
Stable Fast (with xformers & triton) | 31.5 it/s |
from stable-fast.
I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc.
On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get:
-> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s
-> SD1.5 + torch.compile is ~51it/s
-> SD1.5 + stable-fast is ~55it/s
from stable-fast.
I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc.
On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get:
-> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s -> SD1.5 + torch.compile is ~51it/s -> SD1.5 + stable-fast is ~55it/s
Yes, in this doc, the performance of torch.compile on hf diffusers is discussed in detail.
https://huggingface.co/docs/diffusers/optimization/torch2.0
from stable-fast.
So it's about on par with TensorRT but a bit slower than OneFlow and this repo?
from stable-fast.
So it's about on par with TensorRT but a bit slower than OneFlow and this repo?
Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.
from stable-fast.
So it's about on par with TensorRT but a bit slower than OneFlow and this repo?
Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.
Commercial GPUs are expensive and hard to get in my country.
We also have strict international Internet connection limitations here.
So testing cutting-edge ML models is not an easy task for me, please wait.
from stable-fast.
Thanks for the benchmarks!
I wonder if the tile sizes for torch.compile
are just very untuned for consumer hardware 🤔
If you happen to have some more time, could you try:
- making sure you turn on
mode="reduce-overhead"
fortorch.compile
. - Try running with
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1
Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see torch.compile
performance on consumer hardware.
from stable-fast.
Thanks for the benchmarks!
I wonder if the tile sizes for
torch.compile
are just very untuned for consumer hardware 🤔If you happen to have some more time, could you try:
- making sure you turn on
mode="reduce-overhead"
fortorch.compile
.- Try running with
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1
Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see
torch.compile
performance on consumer hardware.
In my own development environment, with 'reduce-overhead', the model just generates buggy outputs...
from stable-fast.
Related Issues (20)
- Inference Error for with Dynamix Resolution HOT 1
- accelent the stablediffusion with lora and controlnet? HOT 2
- FP8 support in stable fast HOT 6
- Installation failed with an error HOT 2
- How can I load the whole model from the compiled one instead of loading only the unet to current existing sd model?
- Issue of too much recompilation after modification of cross attention processor HOT 1
- can stable fast used with diffusers device_map='auto' HOT 1
- v1.0.4 release contains the prebuilt binary for 1.0.3
- Triton group normalization is slower than torch ops in some case. HOT 1
- xformers 25 removes Triton fmha and completely breaks stable-fast, with fix (MemoryEfficientAttentionTritonFwdFlashBwOp, TritonFlashAttentionOp)
- 切换ControlNets HOT 1
- Changing working device will cause problem HOT 3
- [Bug] MemoryEfficientAttentionTritonFwdFlashBwOp no longer availale HOT 1
- opening enable_cuda_graph only support two different size image when reference HOT 1
- Support IP-Adapters
- enable_cuda_graph 影响Lora的切换 HOT 1
- 请问一下谁可以共享一下在config.enable_cuda_graph = True的情况下切换Lora已经可用的Demo嘛 HOT 7
- Install fatal error C1083: windows11 HOT 6
- Stable-fast compatibility with lightning models HOT 1
- gcc "-std=99" may need to be declared in code
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stable-fast.