Currently, the Evaluator class for <code class="notra

The specific use case is the following: Download a model and d

Also <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Add runtime metrics to `Evaluator` about evaluate HOT 7 CLOSED

huggingface commented on June 15, 2024 1

Add runtime metrics to `Evaluator`

from evaluate.

Comments (7)

lewtun commented on June 15, 2024 1

The specific use case is the following:

Download a model and dataset from the Hub
Optimise the model with optimum
Run evaluation of base vs optimised model on dataset (ideally with latencies / throughput reported)

Since we're using the pipeline() function under the hood, I think it would be fine to just support CPU / GPU via the device argument. This would give a baseline for users to start from, and they can always roll their own hardware-specific loop if needed.

For reference, this is currently the function I'm using to compute latencies:

import numpy as np
from time import perf_counter 

def time_pipeline(pipeline, dataset, num_samples=100):
    sample_ds = dataset.shuffle(seed=42).select(range(num_samples))
    latencies = []
    # Timed run
    for sample in sample_ds:
        start_time = perf_counter()
        _ = pipeline(sample["text"])
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
    return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}

from evaluate.

philschmid commented on June 15, 2024 1

For Latency and throughput, we should rather use a dummy input for different sequence lengths instead of selecting a few samples from the dataset.
You normally always have latency & throughput for, e.g. sequence length 128.

from evaluate.

ola13 commented on June 15, 2024 1

I think that's a great idea! True - it is backend dependent, but it will be very useful for debugging.

I wonder if it would be useful to optionally output not only the metric values but some sort of an evaluation report - basic setup information along with the runtime metrics, e.g. what device it was evaluated on etc.

I think it would be valuable to have these numbers for full evaluation, not only dummy input, as it doesn't really cost anything and it can provide additional insights (again, mainly in the debugging scenario)

from evaluate.

lvwerra commented on June 15, 2024 1

I think we can can just add the throughput information to the dict that is returned by the evaluator.

I like the idea of an evaluation report, however, I don't think we can assume to know e.g. the device a the pipeline is running on: for now it is a transformer pipeline but it could be any callable so we would not know how to get that info. The evaluate.save function lets you store any information and by default also saves some system information. Maybe we could extend this and then let the user add whatever can not be easily inferred (e.g. the device of the pipeline). What do you think?

As for dummy inputs: I think this is something we should let the user handle. Maybe we can extend the docs with a dedicated "Evaluator" section and have a guide "How to measure the performance of your pipeline" section where we show best practices.

from evaluate.

lvwerra commented on June 15, 2024

Also @douwekiela proposed this in #23. The difficulty is that these numbers are hardware and inference setup dependent, so the question is a bit what their value would be? Also, how would you calculate them with bootstrapping? Do you have a specific use-case?

cc @ola13

from evaluate.

douwekiela commented on June 15, 2024

I like that use case! Even if the numbers are not directly comparable across systems, I do think people would appreciate the convenience (eg if I want to benchmark two models on the same system).

from evaluate.

ola13 commented on June 15, 2024

I'm happy to look into this within the next 2 week, if someone feels like taking a stab at it feel free to reassign :)

from evaluate.

Add runtime metrics to `Evaluator` about evaluate HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent