Comments (6)
I appreciate the feedback! Right now, evaluation is done in one of three ways:
-
Direct string comparison: this is the default behavior for anything you put in the
__expected
column. -
Basic Javascript logic: using the
eval:
prefix, you can run string checks and keyword matches on output. For example:eval: output.includes('foo')
The test runner expects a piece of Javascript code that returns a pass/fail boolean.
-
Self-grading with LLM: using the
grade:
prefix, you can ask an LLM to evaluate the output against your criteria. For example:grade: output contains a reference to a movie
The test runner uses the provider specified in the
--grader
option -
Human evaluation - the web ui helps facilitate the thumbsup/thumbsdown rating. You can aggregate these ratings and pick the "best" prompt accordingly.
In short, keyword matching and exact overlap should be handled by case 1. Semantic similarity testing is a great suggestion. I'll look into this and see if I can get it added :)
See also: https://www.promptfoo.dev/docs/configuration/expected-outputs
from promptfoo.
Support for semantic similarity is added in #7. When it lands, I'll deploy a new version of the library 0.5.0.
It works like this:
Semantic similarity: using the similar
prefix, you can run compare the semantic similarity of expected vs output using OpenAI embeddings.
For example, the directive similar(0.8): hello world
will test that cosine distance is >= 0.8 for test outputs
Hope this helps!
from promptfoo.
Thanks for the suggestions, @MentalGear! I've simplified the github readme and pointed users toward the docs website, which is definitely easier to navigate.
from promptfoo.
Correction: link is https://github.com/squidgyai/squidgy-testy
from promptfoo.
Thank you for the super-swift reply @typpo and planning on #7 !
With semantic similarity added, promptfoo should be among, if not the, very best open-source prompt-testing framework! (and even be on par with what commercial platforms like Vellum are offering for testing)
Just one more quick suggestion: You might want to consolidate your documentation into one place. It's excellent on https://www.promptfoo.dev/docs/intro - which is also where I found after a bit of digging the evaluation methods you mentioned. But if you also keep a version with different content in the readme.md, it can be confusing as there's no single source of truth (SSOT).
I would suggest keeping the intro and "promo gifts" along with the icon grid in the readme, and have a big link directly to the documentation at https://www.promptfoo.dev/docs/intro. :)
from promptfoo.
Thank you and darn, that was quick ! I just finished writing a blog post about open-source PT frameworks and already had to update it. 😅
from promptfoo.
Related Issues (20)
- How to generate separate report or a breakdown report per test suite instead of just one single report for all test suites? HOT 1
- Feature Request: Add max_tokens config to Azure OpenAI provider HOT 3
- Self-Hosting via provided Dockerfile does not work HOT 3
- Mistral provider is hard-coded to https://api.mistral.ai/v1/chat/completions HOT 1
- Use promptfoo for output evaluation/visualisation but not inferencing? HOT 5
- OpenRouter integration errors HOT 1
- Feature Request: CLI command to retrieve results after manual labelling in csv file HOT 1
- Integration of Google VertexAI gcloud auth HOT 4
- Save evals directly in self-hosted machine HOT 2
- Nunjucks Custom Filters not working HOT 1
- Save evals in self hosted machine when running it from another machine HOT 2
- VertexAI 0.49 release causing issues for VertexAI provider HOT 2
- Option for `contains: False` in assertions HOT 2
- Throw an error if config file not found HOT 1
- Python assertions not working with external.py HOT 2
- Deploying Promptfoo to Railway HOT 14
- llm-rubric with VertexAI Gemini Pro HOT 3
- Add support for threaded tests
- Add support for picking up variables from .env file HOT 4
- Run External File from `tests.csv` HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from promptfoo.