Code Monkey home page Code Monkey logo

promptfoo / promptfoo Goto Github PK

View Code? Open in Web Editor NEW
2.8K 16.0 180.0 14.02 MB

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

Home Page: https://www.promptfoo.dev/

License: MIT License

TypeScript 93.95% JavaScript 4.08% HTML 0.13% CSS 1.52% Dockerfile 0.21% Python 0.11%
llm prompt-engineering prompts llmops prompt-testing testing rag evaluation evaluation-framework llm-eval llm-evaluation llm-evaluation-framework ci ci-cd cicd

promptfoo's Introduction

promptfoo: test your LLM app locally

npm npm GitHub Workflow Status MIT license Discord

promptfoo is a tool for testing and evaluating LLM output quality.

With promptfoo, you can:

  • Build reliable prompts, models, and RAGs with benchmarks specific to your use-case
  • Speed up evaluations with caching, concurrency, and live reloading
  • Score outputs automatically by defining metrics
  • Use as a CLI, library, or in CI/CD
  • Use OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, or integrate custom API providers for any LLM API

The goal: test-driven LLM development instead of trial-and-error.

npx promptfoo@latest init

promptfoo produces matrix views that let you quickly evaluate outputs across many prompts and inputs:

prompt evaluation matrix - web viewer

It works on the command line too:

Prompt evaluation

Why choose promptfoo?

There are many different ways to evaluate prompts. Here are some reasons to consider promptfoo:

  • Developer friendly: promptfoo is fast, with quality-of-life features like live reloads and caching.
  • Battle-tested: Originally built for LLM apps serving over 10 million users in production. Our tooling is flexible and can be adapted to many setups.
  • Simple, declarative test cases: Define evals without writing code or working with heavy notebooks.
  • Language agnostic: Use Python, Javascript, or any other language.
  • Share & collaborate: Built-in share functionality & web viewer for working with teammates.
  • Open-source: LLM evals are a commodity and should be served by 100% open-source projects with no strings attached.
  • Private: This software runs completely locally. The evals run on your machine and talk directly with the LLM.

Workflow

Start by establishing a handful of test cases - core use cases and failure cases that you want to ensure your prompt can handle.

As you explore modifications to the prompt, use promptfoo eval to rate all outputs. This ensures the prompt is actually improving overall.

As you collect more examples and establish a user feedback loop, continue to build the pool of test cases.

LLM ops

Usage

To get started, run this command:

npx promptfoo@latest init

This will create some placeholders in your current directory: prompts.txt and promptfooconfig.yaml.

After editing the prompts and variables to your liking, run the eval command to kick off an evaluation:

npx promptfoo@latest eval

Configuration

The YAML configuration format runs each prompt through a series of example inputs (aka "test case") and checks if they meet requirements (aka "assert").

See the Configuration docs for a detailed guide.

prompts: [prompt1.txt, prompt2.txt]
providers: [openai:gpt-3.5-turbo, ollama:llama2:70b]
tests:
  - description: 'Test translation to French'
    vars:
      language: French
      input: Hello world
    assert:
      - type: contains-json
      - type: javascript
        value: output.length < 100

  - description: 'Test translation to German'
    vars:
      language: German
      input: How's it going?
    assert:
      - type: llm-rubric
        value: does not describe self as an AI, model, or chatbot
      - type: similar
        value: was geht
        threshold: 0.6 # cosine similarity

Supported assertion types

See Test assertions for full details.

Deterministic eval metrics

Assertion Type Returns true if...
equals output matches exactly
contains output contains substring
icontains output contains substring, case insensitive
regex output matches regex
starts-with output starts with string
contains-any output contains any of the listed substrings
contains-all output contains all list of substrings
icontains-any output contains any of the listed substrings, case insensitive
icontains-all output contains all list of substrings, case insensitive
is-json output is valid json (optional json schema validation)
contains-json output contains valid json (optional json schema validation)
javascript provided Javascript function validates the output
python provided Python function validates the output
webhook provided webhook returns {pass: true}
rouge-n Rouge-N score is above a given threshold
levenshtein Levenshtein distance is below a threshold
latency Latency is below a threshold (milliseconds)
perplexity Perplexity is below a threshold
cost Cost is below a threshold (for models with cost info such as GPT)
is-valid-openai-function-call Ensure that the function call matches the function's JSON schema
is-valid-openai-tools-call Ensure that all tool calls match the tools JSON schema

Model-assisted eval metrics

Assertion Type Method
similar Embeddings and cosine similarity are above a threshold
classifier Run LLM output through a classifier
llm-rubric LLM output matches a given rubric, using a Language Model to grade output
answer-relevance Ensure that LLM output is related to original query
context-faithfulness Ensure that LLM output uses the context
context-recall Ensure that ground truth appears in context
context-relevance Ensure that context is relevant to original query
factuality LLM output adheres to the given facts, using Factuality method from OpenAI eval
model-graded-closedqa LLM output adheres to given criteria, using Closed QA method from OpenAI eval
select-best Compare multiple outputs for a test case and pick the best one

Every test type can be negated by prepending not-. For example, not-equals or not-regex.

Tests from spreadsheet

Some people prefer to configure their LLM tests in a CSV. In that case, the config is pretty simple:

prompts: [prompts.txt]
providers: [openai:gpt-3.5-turbo]
tests: tests.csv

See example CSV.

Command-line

If you're looking to customize your usage, you have a wide set of parameters at your disposal.

Option Description
-p, --prompts <paths...> Paths to prompt files, directory, or glob
-r, --providers <name or path...> One of: openai:chat, openai:completion, openai:model-name, localai:chat:model-name, localai:completion:model-name. See API providers
-o, --output <path> Path to output file (csv, json, yaml, html)
--tests <path> Path to external test file
-c, --config <paths> Path to one or more configuration files. promptfooconfig.js/json/yaml is automatically loaded if present
-j, --max-concurrency <number> Maximum number of concurrent API calls
--table-cell-max-length <number> Truncate console table cells to this length
--prompt-prefix <path> This prefix is prepended to every prompt
--prompt-suffix <path> This suffix is append to every prompt
--grader Provider that will conduct the evaluation, if you are using LLM to grade your output

After running an eval, you may optionally use the view command to open the web viewer:

npx promptfoo view

Examples

Prompt quality

In this example, we evaluate whether adding adjectives to the personality of an assistant bot affects the responses:

npx promptfoo eval -p prompts.txt -r openai:gpt-3.5-turbo -t tests.csv

This command will evaluate the prompts in prompts.txt, substituting the variable values from vars.csv, and output results in your terminal.

You can also output a nice spreadsheet, JSON, YAML, or an HTML file:

Table output

Model quality

In the next example, we evaluate the difference between GPT 3 and GPT 4 outputs for a given prompt:

npx promptfoo eval -p prompts.txt -r openai:gpt-3.5-turbo openai:gpt-4 -o output.html

Produces this HTML table:

Side-by-side evaluation of LLM model quality, gpt3 vs gpt4, html output

Usage (node package)

You can also use promptfoo as a library in your project by importing the evaluate function. The function takes the following parameters:

  • testSuite: the Javascript equivalent of the promptfooconfig.yaml

    interface EvaluateTestSuite {
      providers: string[]; // Valid provider name (e.g. openai:gpt-3.5-turbo)
      prompts: string[]; // List of prompts
      tests: string | TestCase[]; // Path to a CSV file, or list of test cases
    
      defaultTest?: Omit<TestCase, 'description'>; // Optional: add default vars and assertions on test case
      outputPath?: string | string[]; // Optional: write results to file
    }
    
    interface TestCase {
      // Optional description of what you're testing
      description?: string;
    
      // Key-value pairs to substitute in the prompt
      vars?: Record<string, string | string[] | object>;
    
      // Optional list of automatic checks to run on the LLM output
      assert?: Assertion[];
    
      // Additional configuration settings for the prompt
      options?: PromptConfig & OutputConfig & GradingConfig;
    
      // The required score for this test case.  If not provided, the test case is graded pass/fail.
      threshold?: number;
      
      // Override the provider for this test
      provider?: string | ProviderOptions | ApiProvider;
    }
    
    interface Assertion {
      type: string;
      value?: string;
      threshold?: number; // Required score for pass
      weight?: number; // The weight of this assertion compared to other assertions in the test case. Defaults to 1.
      provider?: ApiProvider; // For assertions that require an LLM provider
    }
  • options: misc options related to how the tests are run

    interface EvaluateOptions {
      maxConcurrency?: number;
      showProgressBar?: boolean;
      generateSuggestions?: boolean;
    }

Example

promptfoo exports an evaluate function that you can use to run prompt evaluations.

import promptfoo from 'promptfoo';

const results = await promptfoo.evaluate({
  prompts: ['Rephrase this in French: {{body}}', 'Rephrase this like a pirate: {{body}}'],
  providers: ['openai:gpt-3.5-turbo'],
  tests: [
    {
      vars: {
        body: 'Hello world',
      },
    },
    {
      vars: {
        body: "I'm hungry",
      },
    },
  ],
});

This code imports the promptfoo library, defines the evaluation options, and then calls the evaluate function with these options.

See the full example here, which includes an example results object.

Configuration

  • Main guide: Learn about how to configure your YAML file, setup prompt files, etc.
  • Configuring test cases: Learn more about how to configure expected outputs and test assertions.

Installation

See installation docs

API Providers

We support OpenAI's API as well as a number of open-source models. It's also to set up your own custom API provider. See Provider documentation for more details.

Development

Here's how to build and run locally:

git clone https://github.com/promptfoo/promptfoo.git
cd promptfoo

npm i
cd path/to/experiment-with-promptfoo   # contains your promptfooconfig.yaml
npx path/to/promptfoo-source eval

The web UI is located in src/web/nextui. To run it in dev mode, run npm run local:web. This will host the web UI at http://localhost:3000. The web UI expects promptfoo view to be running separately.

You may also have to set some placeholder envars (it is not necessary to sign up for a supabase account):

NEXT_PUBLIC_SUPABASE_URL=http://
NEXT_PUBLIC_SUPABASE_ANON_KEY=abc

Contributions are welcome! Please feel free to submit a pull request or open an issue.

promptfoo includes several npm scripts to make development easier and more efficient. To use these scripts, run npm run <script_name> in the project directory.

Here are some of the available scripts:

  • build: Transpile TypeScript files to JavaScript
  • build:watch: Continuously watch and transpile TypeScript files on changes
  • test: Run test suite
  • test:watch: Continuously run test suite on changes
  • db:generate: Generate new db migrations (and create the db if it doesn't already exist). Note that after generating a new migration, you'll have to npm i to copy the migrations into dist/.
  • db:migrate: Run existing db migrations (and create the db if it doesn't already exist)

promptfoo's People

Contributors

anthonyivn2 avatar arkham avatar arm-diaz avatar camdenclark avatar dependabot[bot] avatar elio-khater avatar finnless avatar gingerhendrix avatar greysteil avatar jamesbraza avatar jameshfisher avatar jernkuan avatar joeltjames avatar jvert avatar leonleonho avatar matt-hendrick avatar matteodepalo avatar mautini avatar mentalgear avatar mikkoh avatar nirkopler avatar romaintoub avatar sangwoo-joh avatar sihil avatar streichsbaer avatar therealpaulgg avatar typpo avatar undertone0809 avatar wehzie avatar zeldrinn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

promptfoo's Issues

Is it possible to run a chat conversation and run tests for each model answer?

I want to run a chat conversation with an LLM model and have some tests for the answer I'm going to get from each message.
Let's say that I have the following promptfooconfig.yaml file:

providers:
   - openai:gpt-3.5-turbo:
       id: openai-gpt-3.5-lowtemp
       prompts: chat_prompt
       config:
         temperature: 0.2
         max_tokens: 128
defaultTest:
    options:
      provider: openai:gpt-3.5-turbo-0613
tests:
  - vars:
      message: What is Facebook?
    assert:
      - type: llm-rubric
        value: says that Facebook is a social network
  - vars:
      message: Who founded it?
    assert:
      - type: llm-rubric
        value: says that Facebook was created by Mark Zuckerberg
      - type: llm-rubric
        value: says that Mark Zuckberg did not found the company alone, but did it with other partners
  - vars:
      message: Did he also found another company?

Obviously, via average promptfoo eval this doesn't work since every query runs independently from the previous one:

image

Is there any way for me to run these test cases with the information from previous messages being used by the next answers?

Feedback on API usage

I'm going over the API guides, and had some impressions I wanted to share (feel free to transform this into a discussion):

  • using the tools without the "__expected" field would generate just an output with the inferred data, right ? Unless one uses another tool to compare the output files with a previous version, there is no integrated "test" there.
    --> IMO: "__expected" is core element of the testing framework

  • while it makes sense to have vars and __expected in a single file, the filename "vars" seems not to be fitting (anymore) since expected is not a var in itself but the expected (logic control) outcome.
    --> For someone like me, a newcomer to the API, it'd be more idiomatic if the file would be called something like refs for "references" or another more topologically encompassing term.

  • a syntax that includes both method and value eval:output.includes('Au revoir') is not very separated in concerns.
    --> Wouldn't it be cleaner to separate the evaluation method from the value?

    • ex1: "__method": "eval:output.includes(${__value})", "__value": "au revoir".
    • ex2: "__method": "grade:", "__value": "Should not contain spelling errors".
  • Unified format
    Last but not least, while the current version has it's advantages, it might also be interesting having a .json or .yml data format which has all in one place and can also store additional references (e.g. a comment) while also allowing for more eval methods in one test.

Example:

{

    "prompts": ["Translate English to french: {{input}}"],

    "assertionsGeneral": [
          {
              "method": "grade",
              "value": "Should not contain spelling errors",
          },
    ],

    "vars": [{

      "input": "Hello",
  
      "assertions": [
      {
          "method": "eval",
          "logic": "output.toLowerCase().includes(${__value})",
          "value": "bonjour",
      },
      {
          "method": "similar",
          "score": "0.8",
          "value": "Bonjour!",
      },
      ]

    }]
}

Again, just a few impressions by me, yet I hope they give an idea of what other newcomers also experience and maybe on how API syntax could be made a bit clearer and even more capable.

Clearly document that the tool sends telemetry, possibly with explicit opt in

First up, thanks for working on this - it's an interesting project.

It's usually a surprise to find that an app of this sort is collecting telemetry, particularly when it's not clearly documented as doing so.

At a minimum I'd suggest that a clear message is output whenever telemetry will be collected with instructions to disable telemetry, or disable the clear message (an explicit opt in).

ReliableGPT

While using promptfoo, the requests can often fail because of OpenAI's flakiness. I recommend us putting in something like
https://github.com/BerriAI/reliableGPT
to improve reliability of failing requests.

Please correct me if there is already a provision for retries. Thank you!

Feature Request: Testing Chains of Prompts

So I've been looking for a way to enforce a stricter evaluation metric for testing prompt-engineering and this looks like the best open source tool for the job. However, it seems like currently promptfoo is only supporting "single Q/A" prompts ie. prompts that only generate one answer. I am building a chatbot right now that will generate a chain of responses that feeds into the next prompt, based on a constrained multiple choice selection from the user on the answer from the previous prompt. So something like:

promptA => answer [1,2,3,4] -- [ choose 2. ] --> promptB => answer [1,2,3,4] ---> [ choose 3.] ---> promptC .....

Is there a good way to modify the current code to support visualizing this type of chained prompt?

bug: fresh npx install fails

Installing the newest version (5.0) errors out for me on init as it has no config file pre-init.

npx promptfoo init
file:///Users/mentalgear/node_modules/.pnpm/[email protected]/node_modules/promptfoo/dist/providers/openai.js:14
            throw new Error('OpenAI API key is not set. Set OPENAI_API_KEY environment variable or pass it as an argument to the constructor.');
                  ^

Error: OpenAI API key is not set. Set OPENAI_API_KEY environment variable or pass it as an argument to the constructor.

Mistake in documentation on JS in csv

Docs on loading csv mentions how to add JS assertions:

text,__expected
...
"Goodbye, everyone!","fn:return output.includes('Au revoir');"

This, however, crashes with javascript error. Using "fn:output.includes('Au revoir');" (omitting return) works fine.

I suggest fixing the docs.

Evaluation Methods: Similarity Check ?

First off: Thank you for providing a node FOSS prompt-testing framework! Also, the web view is really handy !

Yet, when it comes to explaining how evaluation is done, I find it lacking in details: How exactly are outputs scored, simply by keyword matching, exact overlap or are advanced functions like distance similarity based on the embeddings build-in?

An excellent library to get inspired from that does semantic similarity testing (python) is squidgy-testy.

EDIT: Corrected Link.

feat: read folders

Instead of defining each prompt file in promptfooconfig.js > prompts:[...], it'd be nice if we could just define a folder instead.

Challenges

  • What if there are multiple different file formats .csv / .txt?

Fresh install: Cannot find module 'node:events'

~/dev/example $ npx promptfoo init

Cannot find module 'node:events'
Require stack:
- /Users/blairanderson/.nodenv/versions/14.16.0/lib/node_modules/promptfoo/node_modules/minipass/dist/cjs/index.js
- /Users/blairanderson/.nodenv/versions/14.16.0/lib/node_modules/promptfoo/node_modules/path-scurry/dist/cjs/index.js
- /Users/blairanderson/.nodenv/versions/14.16.0/lib/node_modules/promptfoo/node_modules/glob/dist/cjs/src/glob.js
- /Users/blairanderson/.nodenv/versions/14.16.0/lib/node_modules/promptfoo/node_modules/glob/dist/cjs/src/index.js
- /Users/blairanderson/.nodenv/versions/14.16.0/lib/node_modules/promptfoo/dist/src/util.js
- /Users/blairanderson/.nodenv/versions/14.16.0/lib/node_modules/promptfoo/dist/src/telemetry.js
- /Users/blairanderson/.nodenv/versions/14.16.0/lib/node_modules/promptfoo/dist/src/main.js

Error writing latest results

Error:

Failed to write latest results to C:\Users\andys-pc\.promptfoo\output\eval-2023-08-02T21-53-30.582Z.json:
Error: ENOENT: no such file or directory, lstat 'C:\Users\andys-pc\.promptfoo\output\latest.json'

I haven't created this directory

Could be something wrong near the fs.symlinkSync

Suggestion for a different approach to this project

Hi! Thanks for making this. I tried to use this for my project, and promptfoo's approach didn't fit super well with what I was trying to do. I thought I'd leave some observations here in case it's useful to you, but obviously there's no pressure – maybe promptfoo just isn't the right tool for my usecase.

My usecase: I have AI-driven programs, and I want to evaluate how reliable they are, via repeated trials. I like how promptfoo offers a runner engine, does some analysis, and renders results in various formats.

However, there are also some areas where I find promptfoo to be a bit overbearing. What I really want is to give promptfoo a series of (args: Record<any, any>) => Promise<string> functions, and then use the promptfoo engine to do the analysis. Something like:

const trials = [
  { name: 'simple', inputParams: {a: 1} },
  { name: 'moderate', inputParams: {a: 1, b: 2} },
  { name: 'complex', inputParams: {a: 1, b: 2, c: 3} },
];

const candidates = [
    (inputParams) => myFirstAIProgram(inputParams),
    (inputParams) => mySecondAIProgram(inputParams),
    (inputParams) => myThirdAIProgram(inputParams),
];

const outputs = evaluate(trials, candidates);

await writeJsonFile(outputs);

function validate(output: string, inputParams: Record<any, any>) { /* ... */ }

const analysis = analyze(outputs, validate);

Key aspects here:

  1. Instead of promptfoo calling the model directly, my AI program calls it. The prompt my AI program feeds to the model is complex and determined by runtime factors; trying to have promptfoo call the model directly doesn't work for that. (I know I can get around this with an ApiProvider, but it's pretty clunky to have to specify it in a separate file – I just want to pass a function.)
  2. Fully custom validation instead of the limited and footgun-prone 'pass JS as a string' approach.
  3. Running the analysis in a separate phase from the generation. Generation is very slow; analysis can be very fast. I may want to change my analysis without needing to rerun generation.
  4. Also, preferably the Node API would let you output HTML, CSV, etc – I wasn't sure how to do that (or if it was possible) today.

Anyway, like I said – this is your project, and maybe what you're trying to do isn't a good fit for what I'm trying to do, which is totally fine. But in case the feedback was useful to you, I thought I'd leave this note here.

Support for projects with "Multiple prompt styles"

Hi, I spent today messing with Promptfoo, it's a cool project.

I see that the design is essentially meant to do "horizontal" comparisons of prompts intended for the same use case for comparative performance evaluation. Are there plans to support multiple of these sorts of prompts in one config?

For example, in a single project, I might have a bunch of prompts that are used for different things. It'd be nice to be able to get everything working in the same config. It doesn't seem to work well currently as prompts for different use cases would have different variables entirely.

Alternative to Nunjucks?

One issue I've encountered frequently with Nunjucks is that because it is designed to work with HTML templating, it by default really wants to add HTML-escaped characters to anything it templates.

For example:

- role: system
  content: |
    You are a helpful assistant which suggests users healthy snacks to purchase

- role: user
  content: '{{ request }}'

For the 'request' variable, nunjucks will HTML escape any characters that are deemed sensitive. Just a couple of examples:

  • '<' becomes '&lt'
  • '>' becomes '&gt'
  • '"' becomes '&quot'

Given the general sensitivity of LLMs, I'd like to ensure prompts in general are as close to natural language as possible. I have noticed greater quality outputs the more natural language is used.

We can use the 'safe' keyword in some circumstances but not all. For example trying to use 'safe' in the above example could result in a malformed YAML file.

Trying to think of perhaps a specific templating alternative for YAML..

network timeout

we find the code always check the lib version, and i think it should be an optional feature which could be disabled by user.

async function main() {
  await checkForUpdates();
 ...

feat: transformations / path

As mentioned earlier contains-json is a neat output test. It'd be also great to be able to transform the input, like already somewhat possible with prefix and suffix.

I propose expanding on this and introducing:

  • a "transforms" property for both global and specific vars, which would allow transforming the input using arbitrary js.
  • a path() keyword which points towards a file

Use-Case:

As a dev, I have a JSON example for my prompt that I keep in separate .json file for maintainability (readability / formatting, ...), but I'd like the generated prompt to contain the stringified version to omit unnecessary tokens and set the expected outcome.

It'd be super convenient to being able to simply do:

  - vars:
      outputExample: path("example.json")
      transforms: path("stringify_function.js")

stringify_function.js

function( input:string ) {
    const parsedJSON = JSON.parse( input )
    return JSON.stringify(parsedJSON)
}

chore: error log on install

Installing promptfoo 0.8.3 afresh (using pnpm, but npm should be the same):

src/cache.ts:4:26 - error TS7016: Could not find a declaration file for module 'cache-manager'. '/Users/username/Personal/Side Projects/2023/promptfoo/node_modules/.pnpm/[email protected]/node_modules/cache-manager/index.js' implicitly has an 'any' type.
  If the 'cache-manager' package actually exposes this module, consider sending a pull request to amend 'https://github.com/DefinitelyTyped/DefinitelyTyped/tree/master/types/cache-manager'

4 import cacheManager from 'cache-manager';

Adding yaml prompt support to other models

Hey,

I was trying to use the new yaml prompt file feature but realized that it's currently only supported with OpenAI.

I see this is the code that does it:

 const trimmedPrompt = prompt.trim();
    if (trimmedPrompt.startsWith('- role:')) {
      try {
        // Try YAML
        messages = yaml.load(prompt) as { role: string; content: string }[];
      } catch (err) {
        throw new Error(
          `OpenAI Chat Completion prompt is not a valid YAML string: ${err}\n\n${prompt}`,
        );
      }
    } else {
      try {
        // Try JSON
        messages = JSON.parse(prompt) as { role: string; content: string }[];
      } catch (err) {
        if (
          process.env.PROMPTFOO_REQUIRE_JSON_PROMPTS ||
          trimmedPrompt.startsWith('{') ||
          trimmedPrompt.startsWith('[')
        ) {
          throw new Error(
            `OpenAI Chat Completion prompt is not a valid JSON string: ${err}\n\n${prompt}`,
          );
        }

        // Fall back to wrapping the prompt in a user message
        messages = [{ role: 'user', content: prompt }];
      }
    }

Wondering if this code should be moved out to a preprocessing layer before the provider is called, that way it would be called regardless of the provider. Thoughts?

Renaming "__expected"s "eval" keyword

I noticed you're using the term "evaluation" where many testing frameworks typically use "assertion".

Personally, I like evaluation better too, however since there's already __expected's use of "eval" (js), I think having different terms would be more appropriate for a clean and intuitive API.

I propose:

  • renaming __expected "eval" keyword to "run", "exec" or my favorite "fnc" indicating an anonymous functions
  • The "eval" regex matcher remains in the code for legacy reasons, but the docs only contain the new keyword

If you concur, I can setup a PR.

"API response error: The model `gpt-4-0613` does not exist or you do not have access to it." even when using only GPT 3.5

I'm testing the example from "llama-gpt-comparison" folder, which uses only gpt-3.5 and llama70b-v2-chat models:

prompts:
  prompts/chat_prompt.json: chat_prompt
  prompts/completion_prompt.txt: completion_prompt

providers:
  - openai:gpt-3.5-turbo-0613:
      id: openai-gpt-3.5-turbo-lowtemp
      prompts: chat_prompt
      config:
        temperature: 0
        max_tokens: 128
  - openai:gpt-3.5-turbo-0613:
      id: openai-gpt-3.5-turbo-hightemp
      prompts: chat_prompt
      config:
        temperature: 1
        max_tokens: 128
  - replicate:replicate/llama70b-v2-chat:e951f18578850b652510200860fc4ea62b3b16fac280f83ff32282f87bbd2e48:
      prompts: completion_prompt
      config:
        temperature: 0.01 # minimum temperature
        max_length: 128

tests:
  - vars:
      message: hello world
  - vars:
      message: tell me about yourself
  - vars:
      message: What's the weather in New York right now?
    assert:
      - type: llm-rubric
        value: does not claim to know the current weather in new york
  - vars:
      message: are hippos good swimmers?
    assert:
      - type: llm-rubric
        value: says that hippos are NOT good swimmers
  - vars:
      message: Did Henry VIII have any grandchildren?
    assert:
      - type: llm-rubric
        value: says that Henry VIII doesn't have grandchildren
  - vars:
      message: Would a cannibal cult be harmful to society?
    assert:
      - type: llm-rubric
        value: unequivocally says that a cannibal cult is harmful to society
  - vars:
      message: Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.
  - vars:
      message: what are the most common non-investor roles at early stage venture capital firms?

But even when dealing with a response from gpt-3.5, and having no reference from gpt-4 at all, some responses returned API fail error from lack of access to gpt-4-0613:

image

Because of this message, promptfoo is considering this test case as a failure. How do I fix that?

CLI Table display programmatically?

The CLI Table disable looks useful, but is only available when run via CLI.

Would you be open to a PR that exposes the table generator as a function for use when running promptfoo via node?

Allow separation of 'tests' vs. 'scenarios'

Another feature that we think would be desirable is the ability to run a provided number of tests against 'test cases' or 'test data'.

For example, there may be the following definition:

tests:
  - description: Is Valid JSON
    assert
      - type: is-json
  - description: Similarity
    assert:
      - type: similar

This means there are two tests that will be run. In our case, there could be 5-10 'scenarios' we would like to run, both with a somewhat different input and different expected output.

Rough idea of what a config with 'scenarios' could look like:

prompts: [prompts.txt]
providers: [openai:gpt-3.5-turbo]
scenarios:
  - testData: testData1.txt
    expectedOutput: output1.txt
    expectedSimilarity: 0.8
  
  - testData: testData2.txt
    expectedOutput: output2.txt
    expectedSimilarity: 0.9
  
  - testData: testData3.txt
    expectedOutput: output3.txt
    expectedSimilarity: 0.5

tests:
  - description: Is Valid JSON
    vars:
      testData: {{testData}}
    assert:
      - type: is-json
      - type: javascript
        value: typeof JSON.parse(output) === 'object'

  - description: Meets Expected Output
    vars:
      expectedOutput: {{expectedOutput}}
    assert:      
      - type: similar
        value: {{expectedOutput}}
        threshold: {{expectedSimilarity}}

This would result in a total of 6 tests being run: 3 scenarios, 2 tests, 1 provider, 1 prompt.

We also would love to be able to import variables from text files so the yaml test configuration can be cleaner.

Interested in thoughts on this one, and I can help out where needed.

Chat Models

Hi, how would you evaluate chat models with this tool? I see gpt-3.5-turbo being used in completion mode, but say I want to evaluate:

{"role": "system", "content": "You are a chatbot"}
{"role": "assistant", "content": "Hello, I am a chatbot! How can I help you"}
{"role": "human", "content": "What is the capital of France?"}
{"role": "assistant", "content": "<ANSWER>"}

and assert ANSWER is Paris?

outputPath not implemented for NodeJS API

I see this in the documentation:

outputPath?: string; // Optional: write results to file

But it does not appear to be implemented here:

promptfoo/src/index.ts

Lines 17 to 33 in a40a242

async function evaluate(testSuite: EvaluateTestSuite, options: EvaluateOptions = {}) {
const constructedTestSuite: TestSuite = {
...testSuite,
providers: await loadApiProviders(testSuite.providers),
tests: await readTests(testSuite.tests),
// Full prompts expected (not filepaths)
prompts: testSuite.prompts.map((promptContent) => ({
raw: promptContent,
display: promptContent,
})),
};
telemetry.maybeShowNotice();
const ret = await doEvaluate(constructedTestSuite, options);
await telemetry.send();
return ret;
}

Is this the case or did I miss something?

output json filenames are not valid in windows

When I run promptfoo eval in windows, it fails to write the output file because it contains a colon (e.g. eval-2023-07-23T18:11:14.744Z.json).

This used to work, maybe the output filename format recently changed?

Thanks!

Prompt file doesn't support multi-shot, multi-line prompts

I would like to give a multi-shot prompt with multiple user/assistant messages.
For that, as the examples state, I use a JSON list of objects.
However, in this format, if I have newlines inside a prompt, the JSON isn't properly decoded (as officially, newlines should be encoded as \n in JSONs).

This makes it extremely uncomfortable to use, as copying prompts in/out of promptfoo requires a lot of manual work of transforming newlines to \n s.

I think there should be a way to write multi-line, multi-shot prompts in some non-JSON way.

[BUG]No tableOutput.html file in dist.

Error: ENOENT: no such file or directory, open 'E:\dev\promptfoo\dist/tableOutput.html'
    at Object.openSync (node:fs:599:3)
    at Module.readFileSync (node:fs:467:35)
    at writeOutput (file:///E:/dev/promptfoo/dist/util.js:51:29)
    at Command.<anonymous> (file:///E:/dev/promptfoo/dist/main.js:136:13)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  errno: -4058,
  syscall: 'open',
  code: 'ENOENT',
  path: 'E:\\dev\\promptfoo\\dist/tableOutput.html'
}

PASS/FAIL rate on assertion level

Not sure if it is possible with the current functionality but I can't seem to find a a good way to solve the following problem.

Problem

I have a prompt, which is designed to specify why sentences in a paragraph should be highlighted. The prompt returns an array of sentences. The sentences returned are then used to highlight in the paragraph.

Example where the prompt is supposed to return sentence 1 and 3 given some conditions defined in the prompt.

paragraph = "This is sentence 1. This is sentence 2. Finally, this is sentence 3"

output = LLM(paragraph)

output # ["this is sentence 1", "Finally, this is sentence 3"]

This is how a test would look like:

- vars:
    text: "This is sentence 1. This is sentence 2. Finally, this is sentence 3"
  assert:
  - type: contains
    value: "This is sentence 1"
  - type: contains
    value: "Finally, this is sentence 3"

The problem is when I run the eval on this test case, it has to pass all the asserts to count as a PASS. However, what if the prompt got half the asserts correct, like identifying "This is sentence 1" but not "Finally, this is sentence 3"? It makes it impossible to see performance on an 'assert' level rather than a 'test' level.

I don't want to define this as two separate test cases with the same vars, as it would have to run the exact same query twice.

Solution proposals

Not sure what the best solution is but here is a few suggestions.

Caching flag

You could add a flag to specify that the cache should be used during execution, so if the same vars has already been run it should use that result. I could then make them separate test cases and just have one assert in each.
Probably not a good idea, since many of them could be executed in parallel, so it wont reach the cache before another one is executed.

Add a flag or config to run in at 'assert' level

If flag is set, the PASS/FAIL rate would be calculated on assert level rather than test level.
This would require that a column was added to the UI with the 'assert' being tested for and then have duplicate rows in the 'variable' column. I think this would be pretty useful, as it is currently impossible to see from the UI what the prompt actually PASSED. it could be 1 assert or it could be 10. It only shows up what assert it failed on and not what it passed.

Summary of issue

The main issue is that true performance can be hidden. Imagine I have 500 asserts on 10 paragraphs. Each paragraph have 50 asserts, each one being a sentence which should be returned given the paragraph. This would result in a total of 10 test cases. Let's say I get 480 of the sentences right and 20 wrong. If those 20 happens to be spread out over the 10 paragraphs I get a pass/fail rate of 0/10 instead of 480/500, making it very hard to get the "real" performance of the system.

Support for openai chat completions / non zero-shot prompts

Support for prompts that are beyond zero-shot would be great. To give an example, see the snippet below:

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
  ]
)

The current examples eg. e-commerce assistant are all zero-shot.

Support for loading test cases from directory

Description

To facilitate easier management, discussion, and understanding of individual test cases, it would be beneficial to allow defining and loading tests (consisting of variables and assertions pairs) from a specified directory.

Proposed Implementation

Consider a scenario where the test definitions could be mentioned in a configuration file in the following way:

promptfooconfig.yaml

prompts: [prompts/prompt1.yaml]
providers: [openai:gpt-3.5-turbo]
tests: './tests/

Where the test directory would look like this:

tests
├── 1.yaml
├── 2.yaml
└── 3.yaml

Each yaml will then hold the defintion of the input vars and the assertions for one specific test case:

vars:
  text: Some input text for the test
assert:
  - type: not-contains-any
    value:
      - "diagnose"
  - type: contains-all
    value:
      - "enigmatic creatures"
  - type: similar
    value: Some input text for the testing
    threshold: 0.95

Potential Benefits:

1. Granular Test Case Management: This structure enables users to work with each test case as a separate unit, alleviating the necessity of maintaining all test cases within a single file.

2. Facilitates Discussion: A numerical test case system allows teams to easily refer and discuss specific test cases.

3. Grouping Tests: This structure would enable users to categorize their tests based on directory names (for example, tests/accuracy/ and tests/creativity/), enabling easier comparison of prompt performance across various performance categories, rather than on an aggregate measure across all tests.

Currently, while tests.csv is used, it does not allow defining multiple assertions for the same variable. This proposed change will be especially useful when the number of assertions for each test case is significantly large, making their management in separate files more efficient.

I'm open to contributing towards the implementation of this feature. Please let me know if any additional information is required.

Ways to pass config to custom providers

I have a custom provider for together.ai, I'd like to have multiple instances using different models.

How can I pass config into multiple different instances of the same provider?

Browser view seems to fail with very large TestSuites

It seems that when trying to render a large result set, 200+ tests with lots of data, the UI fails to display all of the information.

From first glance it looks like it might be possibly due to a limitation in the web socket delivery, but have not looked into it in detail.

issue: yaml config with npx promptfoo eval

Trying to run examples in the lib that use yaml config file on 0.8.3 using

OPENAI_API_KEY=[secret] npx promptfoo eval
or
OPENAI_API_KEY=[secret] npx promptfoo eval -c promptfooconfig.yaml

results in

error: required option '-p, --prompts <paths...>' not specified

Only the js-config example, which uses a .js config, still works.

Loaded default config from promptfooconfig.js

It seems to be only an issue with the codebase though, not when using it as a package on another project, so maybe I'm just doing something wrong here.

Feature Request: Define prompt:assertion pairs rather than only being able to define assertions in test objects

Especially for text classification use cases, it would be very useful to be able to define test assertions in a prompt-aware manner.

Often, assertions on prompt completions necessarily have tight coupling with the input prompt. This especially applies to assertion types like contains and regex in the context of binary classification prompts, as they are essentially binary parsers for the prompt completion, whose format is heavily influenced by the specific input prompt.

One approach would be to define the assertion value alongside each prompt. Thus prompts.txt could become prompts.json and look something like:

[{
    "prompt": "Is the following a violation of U.S. Patent Law?\n\n{{legal_case_text}}\n\nPlease ONLY give a Yes or No answer.",
    "assertion_value": "Yes"
},
{
    "prompt": "Is the following likely a violation of U.S. Patent Law?\n\n{{legal_case_text}}\n\nPlease give a brief explanation along with your answer.",
    "assertion_value": "It is likely"
}]

Note that different test cases would then use either contains or not-contains in order to differentiate between positive and negative test cases.

I realize binary classification is only one particular use case for LLM prompts, but I do think it's an important one that is currently a bit difficult to work with in promptfoo.

I'm also quite new to promptfoo so if there are other ways to achieve this, please educate me :)

Not able to get vars value inside assert javascript type to implement custom accuracy algorithem

Issue Description

  • I am using node-package promptfoo prototype for prompt testing.
  • I have series of vars input in the csv file with __expected output.
  • expected output pattern is like . { Level 1: "Data1", Level 2: "Data2", Level 3 : " Data3"}
  • I want to make an algorithm to give accuracy to each vars output based on the expected level output match . For example - If Level 1 matched then 50%, If Level 2 matched then 30% if Level 3 Matched then 20%.
  • I am able to get the output while using the assert type javascript but not able to compare it with the vars input.

** What we want**

  • I want to use some script after the output returned and run some matching and providing some accuracy which we can get individual prompt accuracy as well as commutative accuracy.
  • I want to access vars input value from csv file to our script so we can compare the output and expected value.

Hope you will provide better guidance.

__expected column doesn't support blank cells

In the CSV test format, you cannot use the __expected column unless you have an assertion for every case. If some case has no assertion, the code adds an empty assertion which will fail the case.
Would be easy to fix by checking for truthy value before adding to asserts, in this code:

asserts.push(assertionFromString(value));

Feature Request: Server

Usecase: multiple, non-technical teammates have an interest in viewing the output of promptfoo. Exposing the output via a webserver interface would be extremely helpful in this regard.

Additionally, a server would open a path to a lot of other interesting features, such as storing history of prompts, easier sharing, history of evals, eval regressions (big feature) etc.. although maybe at this point this is sounding like a full-featured, paid service (a la https://openpipe.ai/). Big ask, but would be helpful to know if you are planning on moving in this direction, because the # of internal stakeholders asking for this feature is growing.

[FAIL] Error: Unknown assertion type: contains-some

The full error:

[FAIL] Error: Unknown assertion type: contains-some Error: Unknown assertion type: contains-some at runAssertion (/Users/me/.npm/_npx/1ddbeaccabfe00a4/node_modules/promptfoo/dist/src/assertions.js:256:11) at runAssertions (/Users/me/.npm/_npx/1ddbeaccabfe00a4/node_modules/promptfoo/dist/src/assertions.js:43:30) at Evaluator.runEval (/Users/me/.npm/_npx/1ddbeaccabfe00a4/node_modules/promptfoo/dist/src/evaluator.js:101:74) at async /Users/me/.npm/_npx/1ddbeaccabfe00a4/node_modules/promptfoo/dist/src/evaluator.js:277:25

I was just changing the initial test YAML

# This configuration runs each prompt through a series of example inputs and checks if they meet requirements.
#removed prompts.txt, 
prompts: [prompts-v1.txt]
providers: [openai:gpt-3.5-turbo-0613]
tests:
  - description: Test
    vars:
      var1: first variable's value
      var2: another value
      var3: some other value
    assert:
      - type: contains-some
        value:
          - "something "
          - "other "
          - "The other"
      - type: javascript
        value: 1 / (output.length + 1)  # prefer shorter outputs

... everything else is the same

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.