Code Monkey home page Code Monkey logo

iree-comparative-benchmark's Introduction

OpenXLA Benchmark

This is a home for the common benchmarking infrastructure described in the accompanying RFC. It aims to be a common benchmark suite that is compiler-agnostic and can be used in standalone comparative benchmark workflows and regression benchmarking resident in each compiler project.

There are two components in this repository:

The common_benchmark_suite is standalone and should not have dependency on the comparative_benchmark.

Supported Runtimes

Framework Level

These benchmarks are run from the Deep Learning Framework. This is the end-to-end latency seen by the user when running the workload from a framework such as PyTorch. Supported runtimes:

  • JAX with IREE PJRT.
  • JAX and Tensorflow with XLA.
  • PyTorch with Inductor.

Compiler/Library Level

These benchmarks do not include the Deep Learning Framework. This is more reflective of the final deployment environment or AOT deployment. Supported runtimes:

  • JAX, Tensorflow, PyTorch and TFLite with IREE using MLIR input.
  • JAX, Tensorflow with XLA using HLO input.
  • TFLite.
  • GGML (experimental).

Supported Devices

Server

  • GPU: a2-highgpu-1g.
  • CPU: c2-standard-60.
  • (Retired) c2-standard-16.

Mobile

  • Pixel 6 Pro, Pixel 8 Pro.
  • Motorola Edge+ (2023), Motorola Edge x30.
  • (Retired) Pixel 4.

Generated Artifacts

Most workloads are sourced from HuggingFace Transformers and are available in PyTorch, JAX and Tensorflow. Artifacts are generated from each workload and used as input to benchmarks. This decouples the compiler/runtime from the framework and enables comparisons across a wider range of runtimes e.g. It is possible to run compiler-level comparisons between IREE, XLA and TFLite using artifacts derived from the same JAX workload.

Below is a list of artifacts that are generated from each framework:

JAX:

  • StableHLO MLIR.
  • XLA HLO Dump.
  • Tensorflow SavedModel (through JAX2TF).
  • TFLite Flatbuffer. Using Post-Training Quantization, also generates FP16, dynamic-range quantization and INT8 variants.

PyTorch:

  • Linalg MLIR (through torch-mlir).

Tensorflow:

  • StableHLO MLIR.
  • XLA HLO Dump.
  • Tensorflow SavedModel.
  • TFLite Flatbuffer. Using Post-Training Quantization, also generates FP16, dynamic-range quantization and INT8 variants.

TFLite:

  • TOSA MLIR.
  • TFLite flatbuffer.

Input/Output Data

Input and output data is also generated and saved as numpy arrays. This data can be used downstream to test accuracy.

Supported Workloads

Below is a list of workloads currently being benchmarked. To add more workloads, please read "User's Guide".

Single Model

Framework Model Data Type Batch Sizes Input Size
JAX T5-Large FP32, FP16, BF16 1, 16, 24, 32, 48, 64, 512 Sequence length 512
JAX T5-Large for Conditional-Generation FP32 1, 16, 24, 32, 48 Sequence length 512
JAX T5-Small FP32 1 Sequence length 128
JAX Bert-Large FP32, FP16, BF16 1, 16, 24, 32, 48, 64, 512, 1024, 1280 Sequence length 384
JAX Bert-Base FP32, FP16, BF16 1 Input sequences 8, 32, 64, 128, 256, 512
JAX ResNet50 FP32, FP16, BF16 1, 8, 64, 128, 256, 2048 Input image 3x224x224
JAX GPT-2 with LMHead FP32 1 Sequence length 512
JAX ViT FP32 1 Input image 3x224x224
PyTorch Bert-Large FP32, FP16 1, 16, 24, 32, 48, 64, 512, 1024, 1280 Sequence length 384
PyTorch ResNet50 FP32, FP16 1, 8, 64, 128, 256, 2048 Input image 3x224x224
Tensorflow T5-Large FP32 1, 16, 24, 32, 48, 64, 512 Input sequence 512
Tensorflow Bert-Large FP32 1, 16, 24, 32, 48, 64, 512, 1024, 1280 Input sequence 384
Tensorflow RestNet50 FP32 1, 8, 64, 128, 256, 2048 Input image 224x224x3
Tensorflow EfficientNet-B7 FP32 1, 64, 128 Input image 600x600x3
TFLite Bert-Base FP32, FP16, Dynamic-range quant, INT8 1 Input sequences 8, 32, 64, 128, 256, 512
TFLite ViT FP32, FP16, Dynamic-range quant, INT8 1 Input image 3x224x224

Pipeline

Pipelines may include more than one model or control flow.

Framework Pipeline Data Type Variations
JAX T5-Small FP32, FP16, BF16 Token generation sizes: 16, 32, 64, 128, 256
JAX Stable Diffusion FP32, FP16, BF16 Input sequence 64 tokens
JAX GPT-2 with LMHead FP32 Generates 200 tokens
Tensorflow GPT-2 with LMHead FP32 Generates 200 tokens
GGML GPT-2 with LMHead FP32, FP16 Generates 200 tokens

Dashboards

User's Guide

To add new models and benchmarks, see Onboarding New Models and Benchmarks.

Contacts

  • GitHub issues: Feature requests, bugs, and other work tracking
  • OpenXLA discord: Daily development discussions with the core team and collaborators

License

OpenXLA Benchmark is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.

iree-comparative-benchmark's People

Contributors

mariecwhite avatar beckerhe avatar dependabot[bot] avatar gmngeoffrey avatar scotttodd avatar

Stargazers

 avatar  avatar Han-Chung Wang avatar a.saenko avatar Wengang Cao avatar  avatar Lester Covax avatar Andrey Portnoy avatar Kun Wu avatar ElBe2049 avatar

Watchers

Stella Laurenzo avatar  avatar  avatar  avatar

iree-comparative-benchmark's Issues

[XLA-HLO:GPU] Type errors on BERT_LARGE_FP16_JAX_* models

docker run --gpus all --mount="type=bind,src="${PWD}",target=/work" --workdir="/work" "gcr.io/iree-oss/openxla-benchmark/cuda11.8-cudnn8.9@sha256:c39107c4160e749b7c4bac18862c6c1b6d56e1aa60644a4fe323e315ffba0a0b" /work/xla-tools-dir/hlo_runner_main --hlo_file=/work/xla_hlo_before_optimizations.txt --device_type=gpu --num_repeats=50 --input_format=text --num_replicas=1 --num_partitions=1 --logtostderr
2023-08-04 19:15:21.721351: I xla/service/service.cc:168] XLA service 0x5640370dddd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-08-04 19:15:21.721415: I xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
2023-08-04 19:15:21.721767: I xla/pjrt/gpu/se_gpu_pjrt_client.cc:633] Using BFC allocator.
2023-08-04 19:15:21.721826: I xla/pjrt/gpu/gpu_helpers.cc:105] XLA backend allocating 31753961472 bytes on device 0 for BFCAllocator.
2023-08-04 19:15:31.158463: I xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8900
2023-08-04 19:15:34.067278: I xla/stream_executor/gpu/asm_compiler.cc:328] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_295', 996 bytes spill stores, 1108 bytes spill loads

2023-08-04 19:15:36.668819: W xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: Unexpected GEMM dtype: f32 f32 f16
2023-08-04 19:15:36.699421: F xla/tools/multihost_hlo_runner/hlo_runner_main.cc:121] Non-OK-status: xla::FunctionalHloRunner::LoadAndRunAndDump( *client.value(), preproc_options, raw_compile_options, running_options, {hlo_file}, input_format, dump_output_literal_to, task_id) status: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.gemm' failed: Unexpected GEMM dtype: f32 f32 f16; current tracing scope: custom-call; current profiling annotation: XlaModule:#hlo_module=extracted,program_id=131#.

Reproduce:

wget -O xla_hlo_before_optimizations.txt https://storage.googleapis.com/iree-model-artifacts/jax/jax_models_0.4.13_1688607404/BERT_LARGE_FP16_JAX_384XI32_BATCH1/xla_hlo_before_optimizations.txt

docker run --gpus all --mount="type=bind,src="${PWD}",target=/work" --workdir="/work" "gcr.io/iree-oss/openxla-benchmark/cuda11.8-cudnn8.9@sha256:c39107c4160e749b7c4bac18862c6c1b6d56e1aa60644a4fe323e315ffba0a0b" /work/xla-tools-dir/hlo_runner_main --hlo_file=/work/xla_hlo_before_optimizations.txt --device_type=gpu --num_repeats=50 --input_format=text --num_replicas=1 --num_partitions=1 --logtostderr

Fetch dumped input in framework benchmarks

Currently framework benchmarks are calling models' generate_inputs to generate the model inputs.

For most of the models, we dump the generated inputs as npy files (for compiler benchmarks). Framework benchmarks should also load model inputs from those dumps when possible.

Automate artifact generation in CI

We provide the tools to generate benchmark artifacts but it takes efforts to setup. And people need permission to upload artifacts to GCS.

We should provide a workflow in CI to automate this process (and also control the environment).

### Tasks
- [ ] Add workflow to generate artifacts

Support multi-model framework benchmarks

Framework-level models like StableDiffusion, LLM are composed of multiple models. And previously we don't have such example in the benchmark suite.

Add a multi-model benchmark in the early phase to make sure the design can handle it.

Refactor Model Interface to not use a tuple for its `forward` function

TF SavedModel interface fails when we use a tuple as the parameter to the forward() function.

Currently we do:

@tf.function(jit_compile=True)
def forward(self, inputs: Tuple[Any, ...]) -> Tuple[Any, ...]:
  input_ids, attention_mask = inputs
  output = self.model(input_ids, attention_mask,
                      training=False).last_hidden_state
  return (output,)

But in order for it to work with SavedModel, we need to change it to:

@tf.function(jit_compile=True)
def forward_sm(self, input_ids, attention_mask):
  return self.model(input_ids, attention_mask,
                    training=False).last_hidden_state

[Ergonomic] Simplify test data definition process

To add a new test data, we should only need to define the test data source (e.g. image, text) and the artifact generation should auto generate variants for each model. The simplifed definition can be like:

# We only need to have one definition for each input data source. And it only needs to tell where to get the source raw data.
APPLE_IMAGE = ModelTestData(
  name="APPLE_IMAGE",
  source_url="url to download the image"
)
ORANGE_IMAGE = ModelTestData(
  name="ORANGE_IMAGE",
  source_url="url to download the image"
)

# Generate the models with different batch sizes, as what we do today.
# (we can actually expand the template to support different data types)
MODEL_RESNET50_FP32_BATCHES = generate_batch_models(...)
MODEL_RESNET50_FP16_BATCHES = generate_batch_models(...)

# Generate the combinations of all models with the apple image input.
BENCHMARK_LIST = generate_benchmark_cases(
  models=MODEL_RESNET50_FP32_BATCHES + MODEL_RESNET50_FP16_BATCHES,
  input_data=APPLE_IMAGE,
  output_verification="the tolerance to verify the output"
)

Artifact generation process in the benchmark suite

Models usually require the raw inputs (text, images) to be preprocessed (e.g., tokenizer).

Currently we have ModelTestDataArtifact to store the preprocessed data for each model, but we should also include the raw data and the parameters to reproduce the preprocessed data from the raw data.

The same process also needs to handle the exported model generation.

[Migration] Update dashboard to process results from openxla-benchmark

Benchmark results uploaded from openxla-benchmark are a little different from iree-samples.

Here is the example from T5_LARGE_FP32_JAX benchmarks:

From iree-samples: https://gist.github.com/pzread/2be3b2db7c0ffa14518085f08e33b815
From openxla-benchmarks: https://gist.github.com/pzread/cf3e089d0a4e0d5c452aebc1df6dbc44

The major differences are:

  • Changes in benchmark id and benchmark name
  • Changes in some metadata values (e.g. data_type and device field)
  • Changes in the format of python_environment field

I think the pain point will be changing the benchmark id, which means we might need a backfill of the historical data. We can still show new data in the dashboard first then backfill the previous data

SD_PIPELINE_FP16_JAX benchmark executes in FP32

The benchmark attempts to convert SD_PIPELINE_F16_JAX by calling 'to_fp16' on model parameters
https://github.com/iree-org/iree-comparative-benchmark/blob/main/common_benchmark_suite/openxla/benchmark/models/jax/stable_diffusion/stable_diffusion_pipeline.py#L47-L51

The only thing this achieves is to convert model weights into float16. Model activations start as float32 https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py#L257 and elsewhere (e.g. when time embeddings are generated). Whenever a flax module is executed with float16 weights and float32 activations (or vice versa), unless it has an explicit compute type, it promotes everything to float32. https://github.com/google/flax/blob/main/flax/linen/linear.py#L189

One way to actually run it in FP16 is to add dtype in the call here https://github.com/iree-org/iree-comparative-benchmark/blob/main/common_benchmark_suite/openxla/benchmark/models/jax/stable_diffusion/stable_diffusion_pipeline.py#L38

Finalize benchmark id format

Finalize benchmark id before we start importing data into database; otherwise any update on benchmark id means another backfill. Ideally this should be a small RFC

[Goal] Document process to onboard models and benchmarks

### Documentation
- [ ] #84
### P1 Experience Improvements
- [ ] #85
- [ ] https://github.com/openxla/openxla-benchmark/pull/93
- [ ] #81
- [ ] https://github.com/openxla/openxla-benchmark/pull/91
- [ ] #87
- [ ] https://github.com/openxla/openxla-benchmark/pull/103
### P2 Experience Improvements
- [ ] #54
- [ ] #60
- [ ] https://github.com/openxla/openxla-benchmark/issues/104
- [ ] https://github.com/openxla/openxla-benchmark/issues/99
- [ ] https://github.com/openxla/openxla-benchmark/issues/98

Evaluate if we want to move to BuildKit for automatic Docker image handling

BuildKit allows an easy management of Docker images in a CI scenario as it can automatically rebuild images when the image's input files (Dockerfile and context) changes. So no manual builds and pushes are needed.

One of many tutorials describing the workflow is here: https://testdriven.io/blog/faster-ci-builds-with-docker-cache/

PR #77 implements some parts of it in the first commit (We later decided to not implement it for now).

[Goal] Show benchmark results in dashboard

Here are the tasks to show benchmark results from openxla-benchmark in dashboard

### Finalize benchmark IDs and names
- [ ] #47
### Update database and dashboard
- [ ] https://github.com/openxla/openxla-benchmark/issues/20
- [ ] https://github.com/openxla/openxla-benchmark/issues/39

Output 0 exceeds tolerance?

 ~/benchmark > git clone https://github.com/openxla/iree-comparative-benchmark                                                                                     
Cloning into 'iree-comparative-benchmark'...                                                                                                                                                                                                                                                                                                            
remote: Enumerating objects: 2003, done.                                                                                                                                                                                                                                                                                                                
remote: Counting objects: 100% (1051/1051), done.                                                                                                                           
remote: Compressing objects: 100% (440/440), done.                   
remote: Total 2003 (delta 770), reused 683 (delta 585), pack-reused 952                                                                                                                                                                                                                                                                                 
Receiving objects: 100% (2003/2003), 1.01 MiB | 3.58 MiB/s, done.                                                                                                                                                                                                                                                                                       
Resolving deltas: 100% (1148/1148), done.                                                                                                                                                                                                                                                                                                               
 ~/benchmark > cd iree-comparative-benchmark/comparative_benchmark/jax                                                                                                                                                                                                                                                                         
 ~/benchmark/iree-comparative-benchmark/comparative_benchmark/jax > ./setup_venv.sh      
...
 ~/benchmark/iree-comparative-benchmark/comparative_benchmark/jax > source jax-benchmarks.venv/bin/activate                                                        
(jax-benchmarks.venv)  ~/benchmark/iree-comparative-benchmark/comparative_benchmark/jax > ./run_benchmarks.py  -o test -device host-cpu -name models/BERT_BASE_FP32_JAX_I32_SEQLEN32/inputs/INPUT_DATA_MODEL_DEFAULT


--- models/BERT_BASE_FP32_JAX_I32_SEQLEN32/inputs/INPUT_DATA_MODEL_DEFAULT ---
/usr/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
Some weights of FlaxBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: {('pooler', 'dense', 'bias'), ('pooler', 'dense', 'kernel')}
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(jax-benchmarks.venv) ~/benchmark/iree-comparative-benchmark/comparative_benchmark/jax > cat test
{"benchmarks": [{"definition": {"benchmark_name": "models/BERT_BASE_FP32_JAX_I32_SEQLEN32/inputs/INPUT_DATA_MODEL_DEFAULT", "framework": "ModelFrameworkType.JAX", "data_type": "fp32", "batch_size": 1, "compiler": "xla", "device": "host-cpu", "tags": ["transformer-encoder", "bert", "seqlen-32"]}, "metrics": {"framework_level": {"error": "['Output 0 exceeds tolerance. Max diff: 8.474491119384766, atol: 0.5, rtol: 0']"}}}]}

I tried both -device host-cpu and -device host-gpu, and over 20 different models. I got something that looks like valid timings from models/SD_PIPELINE_FP16_JAX_64XI32_BATCH1/inputs/INPUT_DATA_MODEL_DEFAULT, all others return "Output 0 exceeds tolerance."

[Migration] Cleanup benchmark ID and name format

The current benchmark names and IDs in the comparative benchmark suite are temporary and not well-organized (e.g. letter cases and structures).

Before uploading results to the database, we should review all benchmark IDs and names and fix any obvious problems or improvements.

Incompatible version between JAX and transformers

It looks like the JAX benchmarks are failing because the latest JAX 0.4.14 is incompatible with transformers

"error": "Failed to import transformers.models.gpt2.modeling_flax_gpt2 because of the following error (look up to see its traceback):\nmodule 'jax.numpy' has no attribute 'DeviceArray'"

The related issue on transformers mentioned that currently it only supports up to JAX 0.4.13

Support data types in model template

Several models ahve fp32, fp16, and bf16 variants. Currently we have duplicate definition for each data type. It should be easier to define them if we also include the data types in model template in addition to the batch sizes.

[Goal] Run all iree-samples comparative benchmarks

Here are the tasks to run all iree-samples comparative benchmarks in openxla-benchmark workflow.

### Run all XLA/HLO benchmarks
- [ ] #12
### Run all JAX benchmarks
- [ ] #34
### Run all TF/XLA benchmarks
- [ ] #35 
- [ ] #37
### Run all PyTorch Inductor benchmarks
- [ ] #36 
- [ ] #38 
### CI infrastructure
- [ ] #40

[Ergonomic] Local run of framework benchmark should be possible without uploading any artifacts

For framework benchmarks, it should be possible to run a model without uploading any artifacts.

This can be done by generating the artifacts locally first (with the ML framework) and run the benchmarks.

Compiler benchmark is a little bit trickier because they usually need the artifacts exported from the ML frameworks first, while we don't want to pull framework dependencies in compiler benchmarks. So they are not included in this task

Support multiple raw inputs

When generating model artifacts, it is possible that the model can use different inputs. At the moment we are hardcoding one default input but we need to support the option of using different inputs.

[PT-Inductor:GPU] Models failed with `quantile() input tensor must be either float or double dtype`

https://github.com/openxla/openxla-benchmark/actions/runs/5483405564/jobs/9989714567#step:5:1253

--- models/BERT_LARGE_FP16_PT_384XI32_BATCH16/inputs/INPUT_DATA_BERT_LARGE_FP16_PT_384XI32_BATCH16/expected_outputs/OUTPUT_DATA_BERT_LARGE_FP16_PT_384X1024XF16_BATCH16/target_devices/a2-highgpu-1g ---
Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Failed to benchmark model BERT_LARGE_FP16_PT_384XI32_BATCH16.
{
  "definition": {
    "benchmark_id": "models/47cb0d3a-5eb7-41c7-9d7c-97aae7023ecf-MODEL_BERT_LARGE-fp16-PT-384xi32-batch16/inputs/2bbb87cf-a910-4262-a9d8-ceff295f1c24-fp16-batch16/expected_outputs/cd625ba7-fc70-4a87-92eb-5acab0c77beb-fp16-batch16/target_devices/78c56b95-2d7d-44b5-b5fd-8e47aa961108",
    "benchmark_name": "models/BERT_LARGE_FP16_PT_384XI32_BATCH16/inputs/INPUT_DATA_BERT_LARGE_FP16_PT_384XI32_BATCH16/expected_outputs/OUTPUT_DATA_BERT_LARGE_FP16_PT_384X1024XF16_BATCH16/target_devices/a2-highgpu-1g",
    "framework": "ModelFrameworkType.PYTORCH",
    "data_type": "fp16",
    "batch_size": 16,
    "inputs": [
      "16x384xi32",
      "16x384xi32"
    ],
    "outputs": [
      "16x384x1024xf16"
    ],
    "compiler": "xla",
    "device": "a2-highgpu-1g",
    "tags": [
      "transformer-encoder",
      "bert",
      "batch-16"
    ]
  },
  "metrics": {
    "framework_level": {
      "error": "quantile() input tensor must be either float or double dtype"
    }
  }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.