Code Monkey home page Code Monkey logo

Comments (10)

philipturner avatar philipturner commented on July 30, 2024

Variations in execution time based on batch size:

BATCH_SIZE={1,2,3,4,5}
swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" \
--resource-path ../mlpackages/Resources --seed 93 --output-path ../outputs \
--compute-units cpuAndGPU --disable-safety --image-count=BATCH_SIZE
Batch Size Loading Resources Sampling Inference (50 steps) Total
1 4 sec 17 sec 16 sec 37 sec
2 4 sec 20 sec 32 sec 56 sec
3 4 sec 20 sec 49 sec 73 sec
4 4 sec 20 sec 66 sec 90 sec
5 4 sec 21 sec 81 sec 106 sec

Measured manually with the iPhone Timer app, so results may deviate from actual values by ~2 seconds.

from ml-stable-diffusion.

philipturner avatar philipturner commented on July 30, 2024

I guess the benchmarks aren't entirely wrong. The throughput for batched images is 16 seconds/image - probably smaller than Apple's 18 sec because I disabled the NSFW filtering model.

However, Apple should warn users about the ~20 second static overhead. This would be important for people making one-off images where the 40-second feedback loop is their bottleneck, not absolute batched throughput.

from ml-stable-diffusion.

littleowl avatar littleowl commented on July 30, 2024

Curious what your setting is for the compute units. Try setting it to .all
@Option(help: "Compute units to load model with {all,cpuOnly,cpuAndGPU,cpuAndNeuralEngine}") var computeUnits: ComputeUnits = .all
I've not noticed the large sampling every startup - until I changed it to cpuAndGPU and then I can reproduce your findings. - referring to the CLI. With setting to .all, it starts up rather quickly just a second or two.
The behavior might be different with different settings based on device and the memory capabilities.
I would imagine, if you were building an application, that you could account for the setup time. Maple/Native Diffusion implementation has a similar initial startup penalty. I've not fully tested the recommended settings with ANE with this on devices yet, but maybe it can be faster on device with that setup? My guess is with such large models there is a cost loading all the weights to the GPU.

from ml-stable-diffusion.

philipturner avatar philipturner commented on July 30, 2024

referring to the CLI. With setting to .all, it starts up rather quickly just a second or two.

It worked! I had compiled the attention implementation to be GPU-friendly (ORIGINAL), although I did see ANECompilerService compiling something for the neural engine. Perhaps the original sampling pass occurred on the ANE, and the inference pass occurred on the GPU (with 70% utilization).

Latencies: 4 sec, 1 sec, 19 sec. I'll switch back to v1.5 and provide an updated table of latencies, along with performance when optimizing attention for the ANE. Meanwhile, here's the various power consumption metrics during the sampling state with .all:

Here
Stage Timestamp (s) CPU (mW) GPU (mW) ANE (mW)
Load -0.3 2595 35 0
Load -0.2 2815 0 0
Load -0.1 4279 18 0
Sample 0.0 3685 53 0
Sample 0.1 2923 9 0
Sample 0.2 2397 9 0
Sample 0.3 2447 9 0
Sample 0.4 2569 9 0
Sample 0.5 3383 1563 0
Sample 0.6 3611 88 283
Sample 0.7 2622 5227 441
Sample 0.8 1818 5393 3195
Sample 0.9 1717 1903 3859
Sample 1.0 2417 5144 759
Inference 1.1 2531 14464 573
Inference 1.2 440 11207 1549
Inference 1.3 217 1224 4255
Inference 1.4 508 13588 2359
Inference 1.5 1439 18315 1324

Sampling is too quick to 100% prove whether it's actually utilizing the ANE, or just late to report that it started inferencing.

And here's the metrics with .cpuAndGPU (~36 watts during inference):

Here
Stage Timestamp (s) CPU (mW) GPU (mW) ANE (mW)
Sample -0.5 1428 9 0
Sample -0.4 1416 18 0
Sample -0.3 2390 26 0
Sample -0.2 1650 14378 0
Sample -0.1 1982 40245 0
Inference 0 1156 35068 0
Inference 0.1 1105 33366 0
Inference 0.2 839 40909 0
Inference 0.3 1343 27957 0
Inference 0.4 503 38959 0

from ml-stable-diffusion.

philipturner avatar philipturner commented on July 30, 2024

Note that if you try to re-run the command for generating a CoreML model, it will actually silently fail. You have to purge the mlpackages directory. I did not know this when switching between SPLIT_EINSUM and ORIGINAL previously.

BATCH_SIZE={1,2,3,4,5}
swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" \
--resource-path ../mlpackages/Resources --seed 93 --output-path ../outputs \
--compute-units all --disable-safety --image-count=BATCH_SIZE

With attention set to ORIGINAL (~15 watts during inference):

Here
Batch Size Loading Resources Sampling Inference (50 steps) Total
1 3 1 19 24
2 3 2 39 44
3 4 2 60 65
4 3 3 79 85
5 3 3 - -
10 3 6 - -
20 3 9 - -
40 3 17 - -

This seems to have marginally slower batched throughput (20 sec vs 16 sec), but about half the power consumption (15 W vs 36 W). Overall, it seems better than .cpuAndGPU. The GPU:ANE performance ratio stays the same on M1 Ultra, so these should be the best settings on all Apple silicon Macs.

With attention set to SPLIT_EINSUM (~13 watts during inference):

Here
Batch Size Loading Resources Sampling Inference (50 steps) Total
1 5 1 22 29
2 5 2 45 52
3 5 2 68 75

With attention set to SPLIT_EINSUM and only .cpuAndNeuralEngine: (~3 watts during inference)

Here
Batch Size Loading Resources Sampling Inference (50 steps) Total
1 4 1 39 44
1 4 2 77 83
1 4 3 116 122

from ml-stable-diffusion.

philipturner avatar philipturner commented on July 30, 2024

I've predicted the likely (actual) fastest implementation on each M1 model, and adjusted the numbers to match CLI latencies.

Device --compute-unit --attention-implementation Latency (seconds)
Mac Studio (M1 Ultra, 64-core GPU) ALL ORIGINAL 9 -> 14
Mac Studio (M1 Ultra, 48-core GPU) ALL ORIGINAL 13 -> 18
MacBook Pro (M1 Max, 32-core GPU) ALL ORIGINAL 18 -> 24
MacBook Pro (M1 Max, 24-core GPU) ALL ORIGINAL 20 -> 26
MacBook Pro (M1 Pro, 16-core GPU) ALL SPLIT_EINSUM 26 -> 30
MacBook Pro (M1) CPU_AND_NE SPLIT_EINSUM 35 -> 39

Regarding battery life on M1 Max, there's a tradeoff between latency and power efficiency. You may want to use the neural engine when on battery. I assumed 3 W during load and sample, except for 1.5 W (sampling, .cpuAndGPU).

Compute Units Attention Runtime Energy (J) Inferences/Charge Battery Life
.cpuAndGPU ORIGINAL 37 614 ~420 4 hours
.all ORIGINAL 24 297 ~870 6 hours
.all SPLIT_EINSUM 29 304 ~850 7 hours
.cpuAndNeuralEngine SPLIT_EINSUM 44 132 ~1960 24 hours

Assuming a 100 watt-hour battery at 90% health, or 324,000 joules. The battery will be drained from 90% to 10%, a typical real-world scenario.

from ml-stable-diffusion.

hirakujira avatar hirakujira commented on July 30, 2024

The benchmark is still misleading. They said they could generate an image with M1 Ultra 48-core GPU within 13 seconds. And they didn't even use the swift package and neural engine!

The executed program is python_coreml_stable_diffusion.pipeline for macOS devices and a minimal Swift test app built on the StableDiffusion Swift package for iOS and iPadOS devices.

from ml-stable-diffusion.

rovo79 avatar rovo79 commented on July 30, 2024

Stage Timestamp (s) CPU (mW) GPU (mW) ANE (mW)

How do you obtain these detailed mW readings of the process running?

from ml-stable-diffusion.

philipturner avatar philipturner commented on July 30, 2024

sudo powermetrics —sample_rate 100

from ml-stable-diffusion.

rovo79 avatar rovo79 commented on July 30, 2024

from ml-stable-diffusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.