First off: Thank you for providing a node FOSS prompt-testing framework! Also, the web

Support for semantic similarity is added in <a class="issue-link js-issue-link" data-e

Thanks for the suggestions, <a class="user-mention notranslate" data-hovercard-type="u

Thank you for the super-swift reply <a class="user-mention notranslate" data-hovercard

Thank you and darn, that was quick ! I just finished writing a <a href="https://medium

Evaluation Methods: Similarity Check ? about promptfoo HOT 6 CLOSED

promptfoo commented on May 20, 2024

Evaluation Methods: Similarity Check ?

from promptfoo.

Comments (6)

typpo commented on May 20, 2024 1

I appreciate the feedback! Right now, evaluation is done in one of three ways:

Direct string comparison: this is the default behavior for anything you put in the __expected column.
Basic Javascript logic: using the eval: prefix, you can run string checks and keyword matches on output. For example:
```
eval: output.includes('foo')
```
The test runner expects a piece of Javascript code that returns a pass/fail boolean.
Self-grading with LLM: using the grade: prefix, you can ask an LLM to evaluate the output against your criteria. For example:
```
grade: output contains a reference to a movie
```
The test runner uses the provider specified in the --grader option
Human evaluation - the web ui helps facilitate the thumbsup/thumbsdown rating. You can aggregate these ratings and pick the "best" prompt accordingly.

In short, keyword matching and exact overlap should be handled by case 1. Semantic similarity testing is a great suggestion. I'll look into this and see if I can get it added :)

from promptfoo.

typpo commented on May 20, 2024 1

Support for semantic similarity is added in #7. When it lands, I'll deploy a new version of the library 0.5.0.

It works like this:

Semantic similarity: using the similar prefix, you can run compare the semantic similarity of expected vs output using OpenAI embeddings.

For example, the directive similar(0.8): hello world will test that cosine distance is >= 0.8 for test outputs

Hope this helps!

from promptfoo.

typpo commented on May 20, 2024 1

Thanks for the suggestions, @MentalGear! I've simplified the github readme and pointed users toward the docs website, which is definitely easier to navigate.

from promptfoo.

MentalGear commented on May 20, 2024

Correction: link is https://github.com/squidgyai/squidgy-testy

from promptfoo.

MentalGear commented on May 20, 2024

Thank you for the super-swift reply @typpo and planning on #7 !

With semantic similarity added, promptfoo should be among, if not the, very best open-source prompt-testing framework! (and even be on par with what commercial platforms like Vellum are offering for testing)

Just one more quick suggestion: You might want to consolidate your documentation into one place. It's excellent on https://www.promptfoo.dev/docs/intro - which is also where I found after a bit of digging the evaluation methods you mentioned. But if you also keep a version with different content in the readme.md, it can be confusing as there's no single source of truth (SSOT).

I would suggest keeping the intro and "promo gifts" along with the icon grid in the readme, and have a big link directly to the documentation at https://www.promptfoo.dev/docs/intro. :)

from promptfoo.

MentalGear commented on May 20, 2024

Thank you and darn, that was quick ! I just finished writing a blog post about open-source PT frameworks and already had to update it. 😅

from promptfoo.

Evaluation Methods: Similarity Check ? about promptfoo HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent