Code Monkey home page Code Monkey logo

Comments (8)

chengzeyi avatar chengzeyi commented on May 18, 2024 4

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

Accurate benchmark has been conducted for 4090 and 3090 by myself today.

Performance varies very greatly across different hardware/software/platform/driver configurations.
It is very hard to benchmark accurately. And preparing the environment for benchmarking is also a hard job.
I have tested on some platforms before but the results may still be inaccurate.

currently A100 is hard and expensive to rent from cloud server providers in my region.

Benchmark results will be available when I have the access to A100 again.

RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)

Framework SD 1.5 SD 2.1 SD 1.5 ControlNet
Vanilla PyTorch (2.1.0+cu118) 24.9 it/s 27.1 it/s 18.9 it/s
torch.compile (2.1.0+cu118, NHWC UNet) 33.5 it/s 38.2 it/s 22.7 it/s
AITemplate 65.7 it/s 71.6 it/s untested
OneFlow 60.1 it/s 12.9 it/s (??) untested
TensorRT untested untested untested
Stable Fast (with xformers & triton) 61.8 it/s 61.6 it/s 42.3 it/s

RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)

Framework SD 1.5
Vanilla PyTorch (2.1.0+cu118) 22.5 it/s
torch.compile (2.1.0+cu118, NHWC UNet) 25.3 it/s
AITemplate 34.6 it/s
OneFlow 38.8 it/s
TensorRT untested
Stable Fast (with xformers & triton) 31.5 it/s

from stable-fast.

isidentical avatar isidentical commented on May 18, 2024

I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc.

On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get:

-> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s
-> SD1.5 + torch.compile is ~51it/s
-> SD1.5 + stable-fast is ~55it/s

from stable-fast.

chengzeyi avatar chengzeyi commented on May 18, 2024

I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc.

On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get:

-> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s -> SD1.5 + torch.compile is ~51it/s -> SD1.5 + stable-fast is ~55it/s

Yes, in this doc, the performance of torch.compile on hf diffusers is discussed in detail.

https://huggingface.co/docs/diffusers/optimization/torch2.0

from stable-fast.

Chillee avatar Chillee commented on May 18, 2024

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

from stable-fast.

isidentical avatar isidentical commented on May 18, 2024

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

from stable-fast.

chengzeyi avatar chengzeyi commented on May 18, 2024

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

Commercial GPUs are expensive and hard to get in my country.
We also have strict international Internet connection limitations here.

So testing cutting-edge ML models is not an easy task for me, please wait.

from stable-fast.

Chillee avatar Chillee commented on May 18, 2024

Thanks for the benchmarks!

I wonder if the tile sizes for torch.compile are just very untuned for consumer hardware 🤔

If you happen to have some more time, could you try:

  1. making sure you turn on mode="reduce-overhead" for torch.compile.
  2. Try running with TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1

Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see torch.compile performance on consumer hardware.

from stable-fast.

chengzeyi avatar chengzeyi commented on May 18, 2024

Thanks for the benchmarks!

I wonder if the tile sizes for torch.compile are just very untuned for consumer hardware 🤔

If you happen to have some more time, could you try:

  1. making sure you turn on mode="reduce-overhead" for torch.compile.
  2. Try running with TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1

Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see torch.compile performance on consumer hardware.

In my own development environment, with 'reduce-overhead', the model just generates buggy outputs...

from stable-fast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.