iree-org / iree-comparative-benchmark Goto Github PK

License: Apache License 2.0

Shell 18.08% Python 79.73% Dockerfile 2.18%

iree-comparative-benchmark's Introduction

OpenXLA Benchmark

This is a home for the common benchmarking infrastructure described in the accompanying RFC. It aims to be a common benchmark suite that is compiler-agnostic and can be used in standalone comparative benchmark workflows and regression benchmarking resident in each compiler project.

There are two components in this repository:

common_benchmark_suite: The compiler-agnostic benchmark suite.
comparative_benchmark: Tools to run comparative benchmarks with the benchmark suite.

The common_benchmark_suite is standalone and should not have dependency on the comparative_benchmark.

Supported Runtimes

Framework Level

These benchmarks are run from the Deep Learning Framework. This is the end-to-end latency seen by the user when running the workload from a framework such as PyTorch. Supported runtimes:

JAX with IREE PJRT.
JAX and Tensorflow with XLA.
PyTorch with Inductor.

Compiler/Library Level

These benchmarks do not include the Deep Learning Framework. This is more reflective of the final deployment environment or AOT deployment. Supported runtimes:

JAX, Tensorflow, PyTorch and TFLite with IREE using MLIR input.
JAX, Tensorflow with XLA using HLO input.
TFLite.
GGML (experimental).

Supported Devices

Server

GPU: a2-highgpu-1g.
CPU: c2-standard-60.
(Retired) c2-standard-16.

Mobile

Pixel 6 Pro, Pixel 8 Pro.
Motorola Edge+ (2023), Motorola Edge x30.
(Retired) Pixel 4.

Generated Artifacts

Most workloads are sourced from HuggingFace Transformers and are available in PyTorch, JAX and Tensorflow. Artifacts are generated from each workload and used as input to benchmarks. This decouples the compiler/runtime from the framework and enables comparisons across a wider range of runtimes e.g. It is possible to run compiler-level comparisons between IREE, XLA and TFLite using artifacts derived from the same JAX workload.

Below is a list of artifacts that are generated from each framework:

JAX:

StableHLO MLIR.
XLA HLO Dump.
Tensorflow SavedModel (through JAX2TF).
TFLite Flatbuffer. Using Post-Training Quantization, also generates FP16, dynamic-range quantization and INT8 variants.

PyTorch:

Linalg MLIR (through torch-mlir).

Tensorflow:

StableHLO MLIR.
XLA HLO Dump.
Tensorflow SavedModel.
TFLite Flatbuffer. Using Post-Training Quantization, also generates FP16, dynamic-range quantization and INT8 variants.

TFLite:

TOSA MLIR.
TFLite flatbuffer.

Input/Output Data

Input and output data is also generated and saved as numpy arrays. This data can be used downstream to test accuracy.

Supported Workloads

Below is a list of workloads currently being benchmarked. To add more workloads, please read "User's Guide".

Single Model

Framework	Model	Data Type	Batch Sizes	Input Size
JAX	T5-Large	FP32, FP16, BF16	1, 16, 24, 32, 48, 64, 512	Sequence length 512
JAX	T5-Large for Conditional-Generation	FP32	1, 16, 24, 32, 48	Sequence length 512
JAX	T5-Small	FP32	1	Sequence length 128
JAX	Bert-Large	FP32, FP16, BF16	1, 16, 24, 32, 48, 64, 512, 1024, 1280	Sequence length 384
JAX	Bert-Base	FP32, FP16, BF16	1	Input sequences 8, 32, 64, 128, 256, 512
JAX	ResNet50	FP32, FP16, BF16	1, 8, 64, 128, 256, 2048	Input image 3x224x224
JAX	GPT-2 with LMHead	FP32	1	Sequence length 512
JAX	ViT	FP32	1	Input image 3x224x224
PyTorch	Bert-Large	FP32, FP16	1, 16, 24, 32, 48, 64, 512, 1024, 1280	Sequence length 384
PyTorch	ResNet50	FP32, FP16	1, 8, 64, 128, 256, 2048	Input image 3x224x224
Tensorflow	T5-Large	FP32	1, 16, 24, 32, 48, 64, 512	Input sequence 512
Tensorflow	Bert-Large	FP32	1, 16, 24, 32, 48, 64, 512, 1024, 1280	Input sequence 384
Tensorflow	RestNet50	FP32	1, 8, 64, 128, 256, 2048	Input image 224x224x3
Tensorflow	EfficientNet-B7	FP32	1, 64, 128	Input image 600x600x3
TFLite	Bert-Base	FP32, FP16, Dynamic-range quant, INT8	1	Input sequences 8, 32, 64, 128, 256, 512
TFLite	ViT	FP32, FP16, Dynamic-range quant, INT8	1	Input image 3x224x224

Pipeline

Pipelines may include more than one model or control flow.

Framework	Pipeline	Data Type	Variations
JAX	T5-Small	FP32, FP16, BF16	Token generation sizes: 16, 32, 64, 128, 256
JAX	Stable Diffusion	FP32, FP16, BF16	Input sequence 64 tokens
JAX	GPT-2 with LMHead	FP32	Generates 200 tokens
Tensorflow	GPT-2 with LMHead	FP32	Generates 200 tokens
GGML	GPT-2 with LMHead	FP32, FP16	Generates 200 tokens

Dashboards

IREE vs TFLite: Mobile, Server.

User's Guide

To add new models and benchmarks, see Onboarding New Models and Benchmarks.

Contacts

GitHub issues: Feature requests, bugs, and other work tracking
OpenXLA discord: Daily development discussions with the core team and collaborators

License

OpenXLA Benchmark is licensed under the terms of the Apache 2.0 License with LLVM Exceptions. See LICENSE for more information.

iree-comparative-benchmark's People

Contributors

Stargazers

Watchers

Forkers

pzread beckerhe mariecwhite phoenix-meadowlark gleasonk cheshire scotttodd

iree-comparative-benchmark's Issues

[XLA-HLO:GPU] Type errors on BERT_LARGE_FP16_JAX_* models

docker run --gpus all --mount="type=bind,src="${PWD}",target=/work" --workdir="/work" "gcr.io/iree-oss/openxla-benchmark/cuda11.8-cudnn8.9@sha256:c39107c4160e749b7c4bac18862c6c1b6d56e1aa60644a4fe323e315ffba0a0b" /work/xla-tools-dir/hlo_runner_main --hlo_file=/work/xla_hlo_before_optimizations.txt --device_type=gpu --num_repeats=50 --input_format=text --num_replicas=1 --num_partitions=1 --logtostderr
2023-08-04 19:15:21.721351: I xla/service/service.cc:168] XLA service 0x5640370dddd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-08-04 19:15:21.721415: I xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
2023-08-04 19:15:21.721767: I xla/pjrt/gpu/se_gpu_pjrt_client.cc:633] Using BFC allocator.
2023-08-04 19:15:21.721826: I xla/pjrt/gpu/gpu_helpers.cc:105] XLA backend allocating 31753961472 bytes on device 0 for BFCAllocator.
2023-08-04 19:15:31.158463: I xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8900
2023-08-04 19:15:34.067278: I xla/stream_executor/gpu/asm_compiler.cc:328] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_295', 996 bytes spill stores, 1108 bytes spill loads

2023-08-04 19:15:36.668819: W xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: Unexpected GEMM dtype: f32 f32 f16
2023-08-04 19:15:36.699421: F xla/tools/multihost_hlo_runner/hlo_runner_main.cc:121] Non-OK-status: xla::FunctionalHloRunner::LoadAndRunAndDump( *client.value(), preproc_options, raw_compile_options, running_options, {hlo_file}, input_format, dump_output_literal_to, task_id) status: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.gemm' failed: Unexpected GEMM dtype: f32 f32 f16; current tracing scope: custom-call; current profiling annotation: XlaModule:#hlo_module=extracted,program_id=131#.

Reproduce:

wget -O xla_hlo_before_optimizations.txt https://storage.googleapis.com/iree-model-artifacts/jax/jax_models_0.4.13_1688607404/BERT_LARGE_FP16_JAX_384XI32_BATCH1/xla_hlo_before_optimizations.txt

docker run --gpus all --mount="type=bind,src="${PWD}",target=/work" --workdir="/work" "gcr.io/iree-oss/openxla-benchmark/cuda11.8-cudnn8.9@sha256:c39107c4160e749b7c4bac18862c6c1b6d56e1aa60644a4fe323e315ffba0a0b" /work/xla-tools-dir/hlo_runner_main --hlo_file=/work/xla_hlo_before_optimizations.txt --device_type=gpu --num_repeats=50 --input_format=text --num_replicas=1 --num_partitions=1 --logtostderr

[Ergonomic] Reduce name copies in model definitions

Right now we have two copes of names in a model definition: the name field and the artifact URL.

We can extend the template to also help generate artifact URL without writing the model name again.

Fetch dumped input in framework benchmarks

Currently framework benchmarks are calling models' generate_inputs to generate the model inputs.

For most of the models, we dump the generated inputs as npy files (for compiler benchmarks). Framework benchmarks should also load model inputs from those dumps when possible.

Automate artifact generation in CI

We provide the tools to generate benchmark artifacts but it takes efforts to setup. And people need permission to upload artifacts to GCS.

We should provide a workflow in CI to automate this process (and also control the environment).

### Tasks
- [ ] Add workflow to generate artifacts

[Doc] Document what exported models are generated for what models

Currently it requires understanding in artifact generation to know what exported models are generated for what models.

We should make this easier to understand and document it.

Support multi-model framework benchmarks

Framework-level models like StableDiffusion, LLM are composed of multiple models. And previously we don't have such example in the benchmark suite.

Add a multi-model benchmark in the early phase to make sure the design can handle it.

Refactor Model Interface to not use a tuple for its `forward` function

TF SavedModel interface fails when we use a tuple as the parameter to the forward() function.

Currently we do:

@tf.function(jit_compile=True)
def forward(self, inputs: Tuple[Any, ...]) -> Tuple[Any, ...]:
  input_ids, attention_mask = inputs
  output = self.model(input_ids, attention_mask,
                      training=False).last_hidden_state
  return (output,)

But in order for it to work with SavedModel, we need to change it to:

@tf.function(jit_compile=True)
def forward_sm(self, input_ids, attention_mask):
  return self.model(input_ids, attention_mask,
                    training=False).last_hidden_state

[Migration] Migrate TF/XLA benchmark tool

Migrate TF/XLA benchmark tool from https://github.com/iree-org/iree-samples/blob/main/iree-tf/benchmark/benchmark_model.py

[Ergonomic] Simplify test data definition process

To add a new test data, we should only need to define the test data source (e.g. image, text) and the artifact generation should auto generate variants for each model. The simplifed definition can be like:

# We only need to have one definition for each input data source. And it only needs to tell where to get the source raw data.
APPLE_IMAGE = ModelTestData(
  name="APPLE_IMAGE",
  source_url="url to download the image"
)
ORANGE_IMAGE = ModelTestData(
  name="ORANGE_IMAGE",
  source_url="url to download the image"
)

# Generate the models with different batch sizes, as what we do today.
# (we can actually expand the template to support different data types)
MODEL_RESNET50_FP32_BATCHES = generate_batch_models(...)
MODEL_RESNET50_FP16_BATCHES = generate_batch_models(...)

# Generate the combinations of all models with the apple image input.
BENCHMARK_LIST = generate_benchmark_cases(
  models=MODEL_RESNET50_FP32_BATCHES + MODEL_RESNET50_FP16_BATCHES,
  input_data=APPLE_IMAGE,
  output_verification="the tolerance to verify the output"
)

Artifact generation process in the benchmark suite

Models usually require the raw inputs (text, images) to be preprocessed (e.g., tokenizer).

Currently we have ModelTestDataArtifact to store the preprocessed data for each model, but we should also include the raw data and the parameters to reproduce the preprocessed data from the raw data.

The same process also needs to handle the exported model generation.

[PT-Inductor:GPU] Out of memory on BERT_LARGE FP32 with large batch sizes

Large batch sizes (>=64) on BERT_LARGE FP32 runs out of CUDA memeory on a2-highgpu-1g

https://github.com/openxla/openxla-benchmark/actions/runs/5483405564/jobs/9989714567#step:5:923

[Migration] Update dashboard to process results from openxla-benchmark

Benchmark results uploaded from openxla-benchmark are a little different from iree-samples.

Here is the example from T5_LARGE_FP32_JAX benchmarks:

From iree-samples: https://gist.github.com/pzread/2be3b2db7c0ffa14518085f08e33b815
From openxla-benchmarks: https://gist.github.com/pzread/cf3e089d0a4e0d5c452aebc1df6dbc44

The major differences are:

Changes in benchmark id and benchmark name
Changes in some metadata values (e.g. data_type and device field)
Changes in the format of python_environment field

I think the pain point will be changing the benchmark id, which means we might need a backfill of the historical data. We can still show new data in the dashboard first then backfill the previous data

[Migration] Upload benchmark results to GCS

The benchmark IDs and formats from the comparative benchmark suite is slightly different from the iree-samples. We need to backfill in the dashboard before uploading results to gs://comparative-benchmark-artifacts

[JAX] Unable to generate artifacts for BertLarge batch >= 1024

From JAX version 0.4.14 and above, artifacts can no longer be generated for BertLarge batch sizes 1024 and 1240. When running the model in JAX to generate outputs, execution hangs.

[Migration] Migrate PyTorch/Inductor benchmark tool

Migrate PyTorch/Inductor benchmark tool from https://github.com/iree-org/iree-samples/blob/main/iree-torch/benchmark/benchmark_model.py

SD_PIPELINE_FP16_JAX benchmark executes in FP32

The benchmark attempts to convert SD_PIPELINE_F16_JAX by calling 'to_fp16' on model parameters
https://github.com/iree-org/iree-comparative-benchmark/blob/main/common_benchmark_suite/openxla/benchmark/models/jax/stable_diffusion/stable_diffusion_pipeline.py#L47-L51

The only thing this achieves is to convert model weights into float16. Model activations start as float32 https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py#L257 and elsewhere (e.g. when time embeddings are generated). Whenever a flax module is executed with float16 weights and float32 activations (or vice versa), unless it has an explicit compute type, it promotes everything to float32. https://github.com/google/flax/blob/main/flax/linen/linear.py#L189

One way to actually run it in FP16 is to add dtype in the call here https://github.com/iree-org/iree-comparative-benchmark/blob/main/common_benchmark_suite/openxla/benchmark/models/jax/stable_diffusion/stable_diffusion_pipeline.py#L38

[Migration] Migrate all JAX models

Migrate the rest of JAX models from https://github.com/iree-org/iree-samples/blob/main/oobi/benchmark-definitions/python/jax_model_definitions.py

RESNET50_FP32_JAX (Done by #43)
RESNET50_FP16_JAX (Done by #43)
RESNET50_BF16_JAX (Done by #43)
BERT_LARGE_FP32_JAX (Done by #42)
BERT_LARGE_FP16_JAX (Done by #42)
BERT_LARGE_BF16_JAX (Done by #42)
T5_LARGE_FP32_JAX
T5_LARGE_FP16_JAX (Done by #53)
T5_LARGE_BF16_JAX (Done by #53)

[PyTorch] Unable to generate large batch size PyTorch FP16 models.

PyTorch FP16 models require GPU to generate artifacts. We are seeing a CUDA out of memory error when generating models with large batch size. These include:

BERT_LARGE_FP16 batches 512, 1024, 1240.
RESNET_FP16 batch 2048.

Implement output verifier on numpy tensor

Implement output verifier on numpy tensor and it should be able to take the tolerance settings from ModelTestDataArtifact.verifier_parameters

Change Python formatter to Black

Follow the recent change on IREE iree-org/iree@be24f02, we should also replace yapf with black here

Finalize benchmark id format

Finalize benchmark id before we start importing data into database; otherwise any update on benchmark id means another backfill. Ideally this should be a small RFC

Update JAX/IREE benchmarks to point to the right repo

The IREE PJRT plugin has moved from https://github.com/openxla/openxla-pjrt-plugin.git
to https://github.com/openxla/iree/tree/main/integrations/pjrt. This needs to be updated in the benchmarks.

[Goal] Document process to onboard models and benchmarks

### Documentation
- [ ] #84

### P1 Experience Improvements
- [ ] #85
- [ ] https://github.com/openxla/openxla-benchmark/pull/93
- [ ] #81
- [ ] https://github.com/openxla/openxla-benchmark/pull/91
- [ ] #87
- [ ] https://github.com/openxla/openxla-benchmark/pull/103

### P2 Experience Improvements
- [ ] #54
- [ ] #60
- [ ] https://github.com/openxla/openxla-benchmark/issues/104
- [ ] https://github.com/openxla/openxla-benchmark/issues/99
- [ ] https://github.com/openxla/openxla-benchmark/issues/98

[Migration] Migrate all Tensorflow models

Migrate all Tensorflow models from https://github.com/iree-org/iree-samples/blob/main/oobi/benchmark-definitions/python/tf_model_definitions.py

RESNET50_FP32_TF
BERT_LARGE_FP32_TF
T5_LARGE_FP32_TF

Imcompatible version between tensorflow and iree-import-tf

It seems like the last tensorflow release 2.13 is incompatible with the latest iree-import-tf, which requires the new experimental methods (iree-org/iree#14487)

Evaluate if we want to move to BuildKit for automatic Docker image handling

BuildKit allows an easy management of Docker images in a CI scenario as it can automatically rebuild images when the image's input files (Dockerfile and context) changes. So no manual builds and pushes are needed.

One of many tutorials describing the workflow is here: https://testdriven.io/blog/faster-ci-builds-with-docker-cache/

PR #77 implements some parts of it in the first commit (We later decided to not implement it for now).

[Migration] Enable CPU benchmarks in workflow

Add jobs in run_comparative_benchmarks.yml to run CPU benchmarks for

JAX-XLA:CPU
TF-XLA:CPU
XLA-HLO:CPU
PyTorch:CPU

[Goal] Show benchmark results in dashboard

Here are the tasks to show benchmark results from openxla-benchmark in dashboard

### Finalize benchmark IDs and names
- [ ] #47

### Update database and dashboard
- [ ] https://github.com/openxla/openxla-benchmark/issues/20
- [ ] https://github.com/openxla/openxla-benchmark/issues/39

[Doc] Document e2e process to onboard PT benchmarks

Document the process to onboard an PT benchmarks:

Add the model implementation
Generate and upload input/output artifacts
Add mode, benchmark, and test_data definitions

[Migration] Migrate PyTorch models

Migrate all PyTorch models from https://github.com/iree-org/iree-samples/blob/main/oobi/benchmark-definitions/python/pytorch_model_definitions.py

RESNET50_FP32_PT
RESNET50_FP16_PT
BERT_LARGE_FP32_PT
BERT_LARGE_FP16_PT

Output 0 exceeds tolerance?

 ~/benchmark > git clone https://github.com/openxla/iree-comparative-benchmark                                                                                     
Cloning into 'iree-comparative-benchmark'...                                                                                                                                                                                                                                                                                                            
remote: Enumerating objects: 2003, done.                                                                                                                                                                                                                                                                                                                
remote: Counting objects: 100% (1051/1051), done.                                                                                                                           
remote: Compressing objects: 100% (440/440), done.                   
remote: Total 2003 (delta 770), reused 683 (delta 585), pack-reused 952                                                                                                                                                                                                                                                                                 
Receiving objects: 100% (2003/2003), 1.01 MiB | 3.58 MiB/s, done.                                                                                                                                                                                                                                                                                       
Resolving deltas: 100% (1148/1148), done.                                                                                                                                                                                                                                                                                                               
 ~/benchmark > cd iree-comparative-benchmark/comparative_benchmark/jax                                                                                                                                                                                                                                                                         
 ~/benchmark/iree-comparative-benchmark/comparative_benchmark/jax > ./setup_venv.sh      
...
 ~/benchmark/iree-comparative-benchmark/comparative_benchmark/jax > source jax-benchmarks.venv/bin/activate                                                        
(jax-benchmarks.venv)  ~/benchmark/iree-comparative-benchmark/comparative_benchmark/jax > ./run_benchmarks.py  -o test -device host-cpu -name models/BERT_BASE_FP32_JAX_I32_SEQLEN32/inputs/INPUT_DATA_MODEL_DEFAULT


--- models/BERT_BASE_FP32_JAX_I32_SEQLEN32/inputs/INPUT_DATA_MODEL_DEFAULT ---
/usr/lib/python3.9/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
Some weights of FlaxBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: {('pooler', 'dense', 'bias'), ('pooler', 'dense', 'kernel')}
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(jax-benchmarks.venv) ~/benchmark/iree-comparative-benchmark/comparative_benchmark/jax > cat test
{"benchmarks": [{"definition": {"benchmark_name": "models/BERT_BASE_FP32_JAX_I32_SEQLEN32/inputs/INPUT_DATA_MODEL_DEFAULT", "framework": "ModelFrameworkType.JAX", "data_type": "fp32", "batch_size": 1, "compiler": "xla", "device": "host-cpu", "tags": ["transformer-encoder", "bert", "seqlen-32"]}, "metrics": {"framework_level": {"error": "['Output 0 exceeds tolerance. Max diff: 8.474491119384766, atol: 0.5, rtol: 0']"}}}]}

I tried both -device host-cpu and -device host-gpu, and over 20 different models. I got something that looks like valid timings from models/SD_PIPELINE_FP16_JAX_64XI32_BATCH1/inputs/INPUT_DATA_MODEL_DEFAULT, all others return "Output 0 exceeds tolerance."

[Doc] Add example benchmarks for JAX and Tensorflow

#84 only focuses on using PyTorch models as example of onboarding benchmarks. We should also have example models and benchmarks for other frameworks.

[Migration] Cleanup benchmark ID and name format

The current benchmark names and IDs in the comparative benchmark suite are temporary and not well-organized (e.g. letter cases and structures).

Before uploading results to the database, we should review all benchmark IDs and names and fix any obvious problems or improvements.

Incompatible version between JAX and transformers

It looks like the JAX benchmarks are failing because the latest JAX 0.4.14 is incompatible with transformers

"error": "Failed to import transformers.models.gpt2.modeling_flax_gpt2 because of the following error (look up to see its traceback):\nmodule 'jax.numpy' has no attribute 'DeviceArray'"

The related issue on transformers mentioned that currently it only supports up to JAX 0.4.13

[Tracking Bug] List of disabled workloads.

This issue lists the workloads that we have disabled due to regressions in compilers or frameworks that we do not maintain.

Support data types in model template

Several models ahve fp32, fp16, and bf16 variants. Currently we have duplicate definition for each data type. It should be easier to define them if we also include the data types in model template in addition to the batch sizes.

[Goal] Run all iree-samples comparative benchmarks

Here are the tasks to run all iree-samples comparative benchmarks in openxla-benchmark workflow.

### Run all XLA/HLO benchmarks
- [ ] #12

### Run all JAX benchmarks
- [ ] #34

### Run all TF/XLA benchmarks
- [ ] #35 
- [ ] #37

### Run all PyTorch Inductor benchmarks
- [ ] #36 
- [ ] #38

### CI infrastructure
- [ ] #40

[Ergonomic] Local run of framework benchmark should be possible without uploading any artifacts

For framework benchmarks, it should be possible to run a model without uploading any artifacts.

This can be done by generating the artifacts locally first (with the ML framework) and run the benchmarks.

Compiler benchmark is a little bit trickier because they usually need the artifacts exported from the ML frameworks first, while we don't want to pull framework dependencies in compiler benchmarks. So they are not included in this task

Support multiple raw inputs

When generating model artifacts, it is possible that the model can use different inputs. At the moment we are hardcoding one default input but we need to support the option of using different inputs.

[PT-Inductor:GPU] Models failed with `quantile() input tensor must be either float or double dtype`

https://github.com/openxla/openxla-benchmark/actions/runs/5483405564/jobs/9989714567#step:5:1253

--- models/BERT_LARGE_FP16_PT_384XI32_BATCH16/inputs/INPUT_DATA_BERT_LARGE_FP16_PT_384XI32_BATCH16/expected_outputs/OUTPUT_DATA_BERT_LARGE_FP16_PT_384X1024XF16_BATCH16/target_devices/a2-highgpu-1g ---
Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Failed to benchmark model BERT_LARGE_FP16_PT_384XI32_BATCH16.
{
  "definition": {
    "benchmark_id": "models/47cb0d3a-5eb7-41c7-9d7c-97aae7023ecf-MODEL_BERT_LARGE-fp16-PT-384xi32-batch16/inputs/2bbb87cf-a910-4262-a9d8-ceff295f1c24-fp16-batch16/expected_outputs/cd625ba7-fc70-4a87-92eb-5acab0c77beb-fp16-batch16/target_devices/78c56b95-2d7d-44b5-b5fd-8e47aa961108",
    "benchmark_name": "models/BERT_LARGE_FP16_PT_384XI32_BATCH16/inputs/INPUT_DATA_BERT_LARGE_FP16_PT_384XI32_BATCH16/expected_outputs/OUTPUT_DATA_BERT_LARGE_FP16_PT_384X1024XF16_BATCH16/target_devices/a2-highgpu-1g",
    "framework": "ModelFrameworkType.PYTORCH",
    "data_type": "fp16",
    "batch_size": 16,
    "inputs": [
      "16x384xi32",
      "16x384xi32"
    ],
    "outputs": [
      "16x384x1024xf16"
    ],
    "compiler": "xla",
    "device": "a2-highgpu-1g",
    "tags": [
      "transformer-encoder",
      "bert",
      "batch-16"
    ]
  },
  "metrics": {
    "framework_level": {
      "error": "quantile() input tensor must be either float or double dtype"
    }
  }
}