mlx-benchmark

This repo aims to benchmark Apple's MLX operations and layers, on all Apple Silicon chips, along with some GPUs.

Contribute: If you have a device not yet featured in the benchmark, especially the ones listed below, your PR is welcome to broaden the scope and accuracy of this project.

Current devices: M1, M1 Pro, M2, M2 Pro, M2 Max, M2 Ultra, M3 Pro, M3 Max.

Missing devices: M1 Max, M1 Ultra, M3, M3 Ultra, and other CUDA GPUs.

Benchmarks

Benchmarks are generated by measuring the runtime of every mlx operations, along with their equivalent in pytorch with mps, cpu and cuda backends. For each operation, we measure the runtime of multiple experiments. We propose 2 benchmarks based on these experiments:

Detailed benchmark: provides the runtime of each experiment.
Average runtime benchmark: computes the mean of experiments. Easier to navigate, with fewer details.

Installation

Installation on Mac devices

Running the benchmark locally is straightforward. Create a new env with osx-arm64 architecture and install the dependencies.

CONDA_SUBDIR=osx-arm64 conda create -n mlx_benchmark python=3.10 numpy pytorch torchvision scipy requests -c conda-forge

pip install -r requirements.txt

Installation on other devices

Other operating systems than macOS can only run the torch experiments, on CPU or with a CUDA device. Install a new env without the CONDA_SUBDIR=osx-arm64 prefix and install the torch package that matches your CUDA version. Then install all the requirements within requirements.txt, except mlx.

Finally, open the config.py file and set:

USE_MLX = False

to avoid importing the mlx package, which cannot be installed on non-Mac devices.

Run the benchmark

Run on Mac

To run the benchmark on mps, mlx and CPU:

python run_benchmark.py --include_mps=True --include_mlx=True --include_cpu=True

Run on other devices

To run the torch benchmark on CUDA and CPU:

python run_benchmark.py --include_mps=False --include_mlx=False --include_cuda=True --include_cpu=True

Contributing

Everyone can contribute to the benchmark! If you have a missing device or if you want to add a missing layer/operation, please read the contribution guidelines.

Operation	cpu	cuda
Argmax	9.36	0.06
BCE	26.26	25.90
Concat	30.27	0.06
Conv1d	61.74	0.17
Conv2d	26.07	0.09
LeakyReLU	6.11	0.04
Linear	103.52	0.07
MatMul	81.08	0.05
PReLU	4.86	0.05
ReLU	10.29	0.06
SeLU	7.34	0.06
Sigmoid	7.88	0.04
Softmax	21.22	0.04
Softplus	7.25	0.04
Sort	61.68	0.11
Sum	13.22	0.06
SumAll	8.50	0.06

tristanbilot / mlx-benchmark Goto Github PK

mlx-benchmark's Introduction

mlx-benchmark

Benchmarks

Installation

Installation on Mac devices

Installation on other devices

Run the benchmark

Run on Mac

Run on other devices

Contributing

mlx-benchmark's People

Contributors

Stargazers

Watchers

Forkers

mlx-benchmark's Issues

Recommend Projects

Recommend Topics

Recommend Org