Comments (7)
The specific use case is the following:
- Download a model and dataset from the Hub
- Optimise the model with
optimum
- Run evaluation of base vs optimised model on dataset (ideally with latencies / throughput reported)
Since we're using the pipeline()
function under the hood, I think it would be fine to just support CPU / GPU via the device
argument. This would give a baseline for users to start from, and they can always roll their own hardware-specific loop if needed.
For reference, this is currently the function I'm using to compute latencies:
import numpy as np
from time import perf_counter
def time_pipeline(pipeline, dataset, num_samples=100):
sample_ds = dataset.shuffle(seed=42).select(range(num_samples))
latencies = []
# Timed run
for sample in sample_ds:
start_time = perf_counter()
_ = pipeline(sample["text"])
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_ms = 1000 * np.mean(latencies)
time_std_ms = 1000 * np.std(latencies)
print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}
from evaluate.
For Latency and throughput, we should rather use a dummy input for different sequence lengths instead of selecting a few samples from the dataset.
You normally always have latency & throughput for, e.g. sequence length 128.
from evaluate.
I think that's a great idea! True - it is backend dependent, but it will be very useful for debugging.
I wonder if it would be useful to optionally output not only the metric values but some sort of an evaluation report - basic setup information along with the runtime metrics, e.g. what device it was evaluated on etc.
I think it would be valuable to have these numbers for full evaluation, not only dummy input, as it doesn't really cost anything and it can provide additional insights (again, mainly in the debugging scenario)
from evaluate.
I think we can can just add the throughput information to the dict that is returned by the evaluator.
I like the idea of an evaluation report, however, I don't think we can assume to know e.g. the device a the pipeline is running on: for now it is a transformer
pipeline but it could be any callable so we would not know how to get that info. The evaluate.save
function lets you store any information and by default also saves some system information. Maybe we could extend this and then let the user add whatever can not be easily inferred (e.g. the device of the pipeline). What do you think?
As for dummy inputs: I think this is something we should let the user handle. Maybe we can extend the docs with a dedicated "Evaluator" section and have a guide "How to measure the performance of your pipeline" section where we show best practices.
from evaluate.
Also @douwekiela proposed this in #23. The difficulty is that these numbers are hardware and inference setup dependent, so the question is a bit what their value would be? Also, how would you calculate them with bootstrapping? Do you have a specific use-case?
cc @ola13
from evaluate.
I like that use case! Even if the numbers are not directly comparable across systems, I do think people would appreciate the convenience (eg if I want to benchmark two models on the same system).
from evaluate.
I'm happy to look into this within the next 2 week, if someone feels like taking a stab at it feel free to reassign :)
from evaluate.
Related Issues (20)
- ImportError: To be able to use evaluate-metric/rouge, you need to install the following dependencies['nltk'] using 'pip install # Here to have a nice missing dependency error message early on' for instance' HOT 2
- ValueError: Predictions and/or references don't match the expected format. HOT 1
- Does Rouge score support the multilingual language? HOT 1
- Can't use the BLEU offline. HOT 3
- Shouldn't perplexity range from [1 to inf)? HOT 2
- Cannot use it offline! HOT 1
- Allow for specify coda device in perplexity evaluation
- [Question] How to have no preset values sent into `.compute()`
- METEOR has no option to return unaggregated results
- Perplexity metric does not apply batching correctly to tokenization HOT 1
- Module 'glue' doesn't exist on the Hugging Face Hub either. HOT 1
- Unable to run pip install evaluate[template] HOT 1
- [FR] Confidence intervals for metrics
- How to pass generation_kwargs to the TextGeneration evaluator ?
- Metrics for multilabel problems don't match the expected format. HOT 2
- [Question]Shall we adding a faster BLEU score calculator?
- Can't load exist dataset for evaluation HOT 1
- Problems during run initial step HOT 12
- SyntaxError: closing parenthesis '}' HOT 3
- Evaluation of empty strings with MAUVE results in error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from evaluate.