Comments (10)
Variations in execution time based on batch size:
BATCH_SIZE={1,2,3,4,5}
swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" \
--resource-path ../mlpackages/Resources --seed 93 --output-path ../outputs \
--compute-units cpuAndGPU --disable-safety --image-count=BATCH_SIZE
Batch Size | Loading Resources | Sampling | Inference (50 steps) | Total |
---|---|---|---|---|
1 | 4 sec | 17 sec | 16 sec | 37 sec |
2 | 4 sec | 20 sec | 32 sec | 56 sec |
3 | 4 sec | 20 sec | 49 sec | 73 sec |
4 | 4 sec | 20 sec | 66 sec | 90 sec |
5 | 4 sec | 21 sec | 81 sec | 106 sec |
Measured manually with the iPhone Timer app, so results may deviate from actual values by ~2 seconds.
from ml-stable-diffusion.
I guess the benchmarks aren't entirely wrong. The throughput for batched images is 16 seconds/image - probably smaller than Apple's 18 sec because I disabled the NSFW filtering model.
However, Apple should warn users about the ~20 second static overhead. This would be important for people making one-off images where the 40-second feedback loop is their bottleneck, not absolute batched throughput.
from ml-stable-diffusion.
Curious what your setting is for the compute units. Try setting it to .all
@Option(help: "Compute units to load model with {all,cpuOnly,cpuAndGPU,cpuAndNeuralEngine}") var computeUnits: ComputeUnits = .all
I've not noticed the large sampling every startup - until I changed it to cpuAndGPU
and then I can reproduce your findings. - referring to the CLI. With setting to .all
, it starts up rather quickly just a second or two.
The behavior might be different with different settings based on device and the memory capabilities.
I would imagine, if you were building an application, that you could account for the setup time. Maple/Native Diffusion implementation has a similar initial startup penalty. I've not fully tested the recommended settings with ANE with this on devices yet, but maybe it can be faster on device with that setup? My guess is with such large models there is a cost loading all the weights to the GPU.
from ml-stable-diffusion.
referring to the CLI. With setting to .all, it starts up rather quickly just a second or two.
It worked! I had compiled the attention implementation to be GPU-friendly (ORIGINAL
), although I did see ANECompilerService compiling something for the neural engine. Perhaps the original sampling pass occurred on the ANE, and the inference pass occurred on the GPU (with 70% utilization).
Latencies: 4 sec, 1 sec, 19 sec. I'll switch back to v1.5 and provide an updated table of latencies, along with performance when optimizing attention for the ANE. Meanwhile, here's the various power consumption metrics during the sampling state with .all
:
Here
Stage | Timestamp (s) | CPU (mW) | GPU (mW) | ANE (mW) |
---|---|---|---|---|
Load | -0.3 | 2595 | 35 | 0 |
Load | -0.2 | 2815 | 0 | 0 |
Load | -0.1 | 4279 | 18 | 0 |
Sample | 0.0 | 3685 | 53 | 0 |
Sample | 0.1 | 2923 | 9 | 0 |
Sample | 0.2 | 2397 | 9 | 0 |
Sample | 0.3 | 2447 | 9 | 0 |
Sample | 0.4 | 2569 | 9 | 0 |
Sample | 0.5 | 3383 | 1563 | 0 |
Sample | 0.6 | 3611 | 88 | 283 |
Sample | 0.7 | 2622 | 5227 | 441 |
Sample | 0.8 | 1818 | 5393 | 3195 |
Sample | 0.9 | 1717 | 1903 | 3859 |
Sample | 1.0 | 2417 | 5144 | 759 |
Inference | 1.1 | 2531 | 14464 | 573 |
Inference | 1.2 | 440 | 11207 | 1549 |
Inference | 1.3 | 217 | 1224 | 4255 |
Inference | 1.4 | 508 | 13588 | 2359 |
Inference | 1.5 | 1439 | 18315 | 1324 |
Sampling is too quick to 100% prove whether it's actually utilizing the ANE, or just late to report that it started inferencing.
And here's the metrics with .cpuAndGPU
(~36 watts during inference):
Here
Stage | Timestamp (s) | CPU (mW) | GPU (mW) | ANE (mW) |
---|---|---|---|---|
Sample | -0.5 | 1428 | 9 | 0 |
Sample | -0.4 | 1416 | 18 | 0 |
Sample | -0.3 | 2390 | 26 | 0 |
Sample | -0.2 | 1650 | 14378 | 0 |
Sample | -0.1 | 1982 | 40245 | 0 |
Inference | 0 | 1156 | 35068 | 0 |
Inference | 0.1 | 1105 | 33366 | 0 |
Inference | 0.2 | 839 | 40909 | 0 |
Inference | 0.3 | 1343 | 27957 | 0 |
Inference | 0.4 | 503 | 38959 | 0 |
from ml-stable-diffusion.
Note that if you try to re-run the command for generating a CoreML model, it will actually silently fail. You have to purge the mlpackages
directory. I did not know this when switching between SPLIT_EINSUM
and ORIGINAL
previously.
BATCH_SIZE={1,2,3,4,5}
swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" \
--resource-path ../mlpackages/Resources --seed 93 --output-path ../outputs \
--compute-units all --disable-safety --image-count=BATCH_SIZE
With attention set to ORIGINAL
(~15 watts during inference):
Here
Batch Size | Loading Resources | Sampling | Inference (50 steps) | Total |
---|---|---|---|---|
1 | 3 | 1 | 19 | 24 |
2 | 3 | 2 | 39 | 44 |
3 | 4 | 2 | 60 | 65 |
4 | 3 | 3 | 79 | 85 |
5 | 3 | 3 | - | - |
10 | 3 | 6 | - | - |
20 | 3 | 9 | - | - |
40 | 3 | 17 | - | - |
This seems to have marginally slower batched throughput (20 sec vs 16 sec), but about half the power consumption (15 W vs 36 W). Overall, it seems better than .cpuAndGPU
. The GPU:ANE performance ratio stays the same on M1 Ultra, so these should be the best settings on all Apple silicon Macs.
With attention set to SPLIT_EINSUM
(~13 watts during inference):
Here
Batch Size | Loading Resources | Sampling | Inference (50 steps) | Total |
---|---|---|---|---|
1 | 5 | 1 | 22 | 29 |
2 | 5 | 2 | 45 | 52 |
3 | 5 | 2 | 68 | 75 |
With attention set to SPLIT_EINSUM
and only .cpuAndNeuralEngine
: (~3 watts during inference)
Here
Batch Size | Loading Resources | Sampling | Inference (50 steps) | Total |
---|---|---|---|---|
1 | 4 | 1 | 39 | 44 |
1 | 4 | 2 | 77 | 83 |
1 | 4 | 3 | 116 | 122 |
from ml-stable-diffusion.
I've predicted the likely (actual) fastest implementation on each M1 model, and adjusted the numbers to match CLI latencies.
Device | --compute-unit |
--attention-implementation |
Latency (seconds) |
---|---|---|---|
Mac Studio (M1 Ultra, 64-core GPU) | ALL |
ORIGINAL |
9 -> 14 |
Mac Studio (M1 Ultra, 48-core GPU) | ALL |
ORIGINAL |
13 -> 18 |
MacBook Pro (M1 Max, 32-core GPU) | ALL |
ORIGINAL |
18 -> 24 |
MacBook Pro (M1 Max, 24-core GPU) | ALL |
ORIGINAL |
20 -> 26 |
MacBook Pro (M1 Pro, 16-core GPU) | ALL |
SPLIT_EINSUM |
26 -> 30 |
MacBook Pro (M1) | CPU_AND_NE |
SPLIT_EINSUM |
35 -> 39 |
Regarding battery life on M1 Max, there's a tradeoff between latency and power efficiency. You may want to use the neural engine when on battery. I assumed 3 W during load and sample, except for 1.5 W (sampling, .cpuAndGPU
).
Compute Units | Attention | Runtime | Energy (J) | Inferences/Charge | Battery Life |
---|---|---|---|---|---|
.cpuAndGPU |
ORIGINAL |
37 | 614 | ~420 | 4 hours |
.all |
ORIGINAL |
24 | 297 | ~870 | 6 hours |
.all |
SPLIT_EINSUM |
29 | 304 | ~850 | 7 hours |
.cpuAndNeuralEngine |
SPLIT_EINSUM |
44 | 132 | ~1960 | 24 hours |
Assuming a 100 watt-hour battery at 90% health, or 324,000 joules. The battery will be drained from 90% to 10%, a typical real-world scenario.
from ml-stable-diffusion.
The benchmark is still misleading. They said they could generate an image with M1 Ultra 48-core GPU within 13 seconds. And they didn't even use the swift package and neural engine!
The executed program is python_coreml_stable_diffusion.pipeline for macOS devices and a minimal Swift test app built on the StableDiffusion Swift package for iOS and iPadOS devices.
from ml-stable-diffusion.
Stage Timestamp (s) CPU (mW) GPU (mW) ANE (mW)
How do you obtain these detailed mW readings of the process running?
from ml-stable-diffusion.
sudo powermetrics —sample_rate 100
from ml-stable-diffusion.
from ml-stable-diffusion.
Related Issues (20)
- How to modify or update the model weight in swift? HOT 2
- Crashes on getting `currentImages`
- When will training be implemented?
- Torch2coreml multiprocessing
- Won't work on GPU on macOS 14.4 Beta 1 due to CoreML bug HOT 3
- Chunk Size for Einsum and Einsumv2 attention vs Resolution HOT 2
- CGImage extension method planarRGBShapedArray seems to leak memory
- Roadmap for this branch
- LCM scheduler HOT 2
- webui.sh missing HOT 1
- Question about chunking UNET HOT 1
- Question about quantization in torch2coreml
- Custom Pipeline
- Input X contains infinity - Error when converting quantized SD-Turbo HOT 3
- Mixed Palettization of SD 1.5 LCM model HOT 4
- Plz supprot Stable Cascade!
- iOS App storage (documents and data) keep increasing HOT 4
- ValueError: Input X contains infinity or a value too large for dtype('float64').
- chunk unet does not work with coremltool>7.1 HOT 1
- MLFeatureDescription "time_ids" crash
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ml-stable-diffusion.