confident-ai / deepeval Goto Github PK

View Code? Open in Web Editor NEW

1.8K 16.0 122.0 27.23 MB

The LLM Evaluation Framework

Home Page: https://docs.confident-ai.com/

License: Apache License 2.0

Python 100.00%

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

deepeval's Introduction

Documentation | Metrics and Features | Getting Started | Integrations | Confident AI

DeepEval is a simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.

Want to talk LLM evaluation? Come join our discord.

🔥 Metrics and Features

Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that runs locally on your machine:
- G-Eval
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS
- Hallucination
- Toxicity
- Bias
- etc.
Evaluate your entire dataset in bulk in under 20 lines of Python code in parallel. Do this via the CLI in a Pytest-like manner, or through our evaluate() function.
Create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
Integrates seamlessly with ANY CI/CD environment.
Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code., which includes:
- MMLU
- HellaSwag
- DROP
- BIG-Bench Hard
- TruthfulQA
- HumanEval
- GSM8K
Automatically integrated with Confident AI for continous evaluation throughout the lifetime of your LLM (app):
- log evaluation results and analyze metrics pass / fails
- compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
- debug evaluation results via LLM traces
- manage evaluation test cases / datasets in one place
- track events to identify live LLM responses in production
- real-time evaluation in production
- add production events to existing evaluation datasets to strength evals over time

(Note that while some metrics are for RAG, others are better for a fine-tuning use case. Make sure to consult our docs to pick the right metric.)

🔌 Integrations

🦄 LlamaIndex, to unit test RAG applications in CI/CD
🤗 Hugging Face, to enable real-time evaluations during LLM fine-tuning

🚀 QuickStart

Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.

Installation

pip install -U deepeval

Create an account (highly recommended)

Although optional, creating an account on our platform will allow you to log test results, enabling easy tracking of changes and performances over iterations. This step is optional, and you can run test cases even without logging in, but we highly recommend giving it a try.

To login, run:

deepeval login

Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy here).

Writing your first test case

Create a test file:

touch test_chatbot.py

Open test_chatbot.py and write your first test case using DeepEval:

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_case():
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        # Replace this with the actual output from your LLM application
        actual_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [answer_relevancy_metric])

Set your OPENAI_API_KEY as an environment variable (you can also evaluate using your own custom model, for more details visit this part of our docs):

export OPENAI_API_KEY="..."

And finally, run test_chatbot.py in the CLI:

deepeval test run test_chatbot.py

Your test should have passed ✅ Let's breakdown what happened.

The variable input mimics user input, and actual_output is a placeholder for your chatbot's intended output based on this query.
The variable retrieval_context contains the relevant information from your knowledge base, and AnswerRelevancyMetric(threshold=0.5) is an out-of-the-box metric provided by DeepEval. It helps evaluate the relevancy of your LLM output based on the provided context.
The metric score ranges from 0 - 1. The threshold=0.5 threshold ultimately determines whether your test has passed or not.

Read our documentation for more information on how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.

Evaluating Without Pytest Integration

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])

Using Standalone Metrics

DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)

answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
# Most metrics also offer an explanation
print(answer_relevancy_metric.reason)

Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.

Evaluating a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:

import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])

dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])

@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    hallucination_metric = HallucinationMetric(threshold=0.3)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])

# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_<filename>.py -n 4

Alternatively, although we recommend using deepeval test run, you can evaluate a dataset/test cases without using our Pytest integration:

from deepeval import evaluate
...

evaluate(dataset, [answer_relevancy_metric])
# or
dataset.evaluate([answer_relevancy_metric])

Real-time Evaluations on Confident AI

We offer a free web platform for you to:

Log and view all the test results / metrics data from DeepEval's test runs.
Debug evaluation results via LLM traces.
Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
Create, manage, and centralize your evaluation datasets.
Track events in production and augment your evaluation dataset for continous evaluation.
Track events in production, view evaluation results and historical insights.

Everything on Confident AI, including how to use Confident is available here.

To begin, login from the CLI:

deepeval login

Follow the instructions to log in, create your account, and paste your API key into the CLI.

Now, run your test file again:

deepeval test run test_chatbot.py

You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Roadmap

Features:

Implement G-Eval
Referenceless Evaluation
Production Evaluation & Logging
Evaluation Dataset Creation

Integrations:

Authors

Built by the founders of Confident AI. Contact [email protected] for all enquiries.

License

DeepEval is licensed under Apache 2.0 - see the LICENSE.md file for details.

deepeval's People

Contributors

Stargazers

Watchers

Forkers

aiworkspace itsharex hbcbh1999 henry-zeng arcrats dpandove muralidharand coinhubx j-space-b rogerspy likhith00 integrate-your-mind partnerise thanhpham1987 emekaokoli19 nitin-mane mukanzi earnest-testlabs donaldwasserman 3a1b2c3 chziakas aayushchou andrea23romano anindyadeep tanduong techthiyanes vasilije1990 sagaust f901107 krrishdholakia agokrani onbncbjocp68898 fernandezbaptiste brunoscaglione razzzzee makoweb3 pomcho555 jaywyawhare kklingeman kbdub ai-app ravinderk1191 rigvedrocks lvsuno pratyush-exe jansystemic rajarampanigrahi tuanzi1015 mdwoicke sundogs8603 liugangtaotie scchess niumeng07 alaaboukhary-aily nikhase christophergs se-hun hipotures lozanocelia jeffometer seekh-app nicholasburka vinayreddy100 alvaropaco mentordotgit kb231000 nictuku naneded6 dsnnaveen kiselitza r4zz3 mikkeyboi ashely94 bderenzi sachinmyneni lmbarak01 jaytoday moruga123 andresprez rohinish404 wanghaoxue0 andysucao kritinv dibs06 jwretham askngan navkar98 kubre cprhero7744 ai-bassem peilun-li kelp710 ishan-marikar o7s8r6 xinweiwanggithub suryamahadi-gdp hopewangms terminalgravity elafo pedroallenrevez

deepeval's Issues

Add unstructured integration for synthetic data

Add unstructured integration to help create synthetic data for benchmarking a lot easier from data sources for LLM applications! 👍

Enable running an individual test with deepeval CLI

'Answer Length' metric

aka 'Brevity' metric. Simple character count of answer (this metric might be somewhere else in the code but I'm not seeing it?) Some models could score higher on relevancy but with more words (in theory) - an 'answer length' metric would control for that.

Litellm integration request

I think it's a good idea to integrate with the Litellm as it provides an easy to integrate with 100+ LLMs.

https://github.com/BerriAI/litellm

LLM as a drop in replacement for GPT. Use Azure, OpenAI, Cohere, Anthropic, Ollama, VLLM, Sagemaker, HuggingFace, Replicate (100+ LLMs)

Parallelized Tests

Support running parallelized tests with pytest xdist.

Desired developer experience:

# Distribute the number of workers to 4
deepeval test run test_sample.py -num_workers 4

Motivation:
One problem we're running into is it's currently taking too long to run unit tests for LLMs, and it's bad developer experience for engineers to wait for 5-10 minutes to get their test results.

For example, it takes around 30 seconds to generate a LLM output depending on the length of the response, and simply running 10 test (which isn't a lot) takes quite a while.

We're hoping to fix this with this issue.

Add Translation Similarity

Looking to potentially implement COMET's neural translation framework to compare against other AI companies.

Unbabel's COMET
https://github.com/Unbabel/COMET

This can be important to ensure consistent performance when changing models for different queries.

assert_translation_performance(...)

Add LLMEvalMetric

Prompt to use for LLMEvalMetric

We provide a question and the 'ground-truth' answer. We also provide \
the predicted answer.

Evaluate whether the predicted answer is correct, given its similarity \
to the ground-truth. If details provided in predicted answer are reflected \
in the ground-truth answer, return "YES". To return "YES", the details don't \
need to exactly match. Be lenient in evaluation if the predicted answer \
is missing a few details. Try to make sure that there are no blatant mistakes. \
Otherwise, return "NO".

Question: {question}
Ground-truth Answer: {gt_answer}
Predicted Answer: {pred_answer}
Evaluation Result: \

As featured from this guide:

https://gpt-index.readthedocs.io/en/latest/examples/finetuning/knowledge/finetune_knowledge.html

Only 1 metric score is recorded

Will need more than 1 metric score to be recorded for the dashboard to be more useful

Compare Anthropic Claude and ChatGPT in Factual Consistency, Answer Relevancy

Cohere Integration

A super valuable integration would be Cohere's Reranker integration. Will need to check if Cohere has anything useful that might be able to evaluate LLMs.

Add Ruff Linter

Add context to the dataset column

When bulk reviewing a dataset, we need to add a context column to make sure we review it properly.

error in command -> deepeval test run test_sample.py

⠋ Downloading models (may take up to 2 minutes if running for the first time)...Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 32, in run
self.live.refresh()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 241, in refresh
with self.console:
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 864, in exit
self._exit_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 822, in _exit_buffer
self._check_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 2038, in _check_buffer
write(text)
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u283c' in position 10: character maps to
*** You may need to add PYTHONIOENCODING=utf-8 to your environment ***
FF.s [100%]
============================================================ FAILURES =============================================================
_____________________________________________________________ test_1 ______________________________________________________________

def test_1():
    # Check to make sure it is relevant
    query = "What is the capital of France?"
    output = "The capital of France is Paris."
    metric = RandomMetric()
    # Comment this out for differne metrics/models
    # metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=query, output=output)

  assert_test(test_case, [metric])

test_sample.py:18:

venv\lib\site-packages\deepeval\run_test.py:252: in assert_test
return run_test(
venv\lib\site-packages\deepeval\run_test.py:239: in run_test
measure_metric()
venv\lib\site-packages\deepeval\retry.py:39: in wrapper
raise last_error # Raise the last error
venv\lib\site-packages\deepeval\retry.py:23: in wrapper
result = func(*args, **kwargs)

@retry(
    max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
    score = metric.measure(test_case)
    success = metric.is_successful()
    if isinstance(test_case, LLMTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            context=test_case.context if test_case.context else "-",
        )

        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            metadata=None,
            context=test_case.context,
        )
    elif isinstance(test_case, SearchTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=str(test_case.output_list)
            if test_case.output_list
            else "-",
            expected_output=str(test_case.golden_list)
            if test_case.golden_list
            else "-",
            context="-",
        )
        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output_list
            if test_case.output_list
            else "-",
            expected_output=test_case.golden_list
            if test_case.golden_list
            else "-",
            metadata=None,
            context="-",
        )
    else:
        raise ValueError("TestCase not supported yet.")
    test_results.append(test_result)

    if raise_error:

      assert (

            metric.is_successful()
        ), f"{metric.__name__} failed. Score: {score}."

E AssertionError: Random failed. Score: 0.2304173247532807.

venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2304173247532807.
Max retries (1) exceeded.
_____________________________________________________________ test_2 ______________________________________________________________

def test_2():
    # Check to make sure it is factually consistent
    output = "Cells have many major components, including the cell membrane, nucleus, mitochondria, and endoplasmic reticulum." 
    context = "Biology"
    metric = RandomMetric()
    # Comment this out for factual consistency tests
    # metric = FactualConsistencyMetric(minimum_score=0.8)
    test_case = LLMTestCase(output=output, context=context)

  assert_test(test_case, [metric])

test_sample.py:29:

@retry(
    max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
    score = metric.measure(test_case)
    success = metric.is_successful()
    if isinstance(test_case, LLMTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            context=test_case.context if test_case.context else "-",
        )

        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            metadata=None,
            context=test_case.context,
        )
    elif isinstance(test_case, SearchTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=str(test_case.output_list)
            if test_case.output_list
            else "-",
            expected_output=str(test_case.golden_list)
            if test_case.golden_list
            else "-",
            context="-",
        )
        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output_list
            if test_case.output_list
            else "-",
            expected_output=test_case.golden_list
            if test_case.golden_list
            else "-",
            metadata=None,
            context="-",
        )
    else:
        raise ValueError("TestCase not supported yet.")
    test_results.append(test_result)

    if raise_error:

      assert (

            metric.is_successful()
        ), f"{metric.__name__} failed. Score: {score}."

E AssertionError: Random failed. Score: 0.2175566574653277.

venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2175566574653277.
Max retries (1) exceeded.
====================================================== slowest 10 durations =======================================================
4.99s call test_sample.py::test_1
2.37s call test_sample.py::test_2
2.32s call test_sample.py::test_3

(7 durations < 0.005s hidden. Use -vv to show these durations.)
===================================================== short test summary info =====================================================
FAILED test_sample.py::test_1 - AssertionError: Random failed. Score: 0.2304173247532807.
FAILED test_sample.py::test_2 - AssertionError: Random failed. Score: 0.2175566574653277.
2 failed, 1 passed, 1 skipped in 10.94s
✅ Tests finished! View results on https://app.confident-ai.com/

Add "logical fallacy" check to model output

I'd added FallacyChain to LangChain last month (checks for logical fallacies) - link below - thinking I could add similar functionality to natural language model output here - thoughts?
https://python.langchain.com/docs/guides/safety/logical_fallacy_chain

Exposed OpenAI API key

Looks like you've got an API key in your docs here: https://docs.confident-ai.com/docs/tutorials/evaluating-langchain

embeddings = OpenAIEmbeddings(openai_api_key=....)

Improve typing (Mypy) support

Integration with HuggingFace Datasets

Integrate a EvaluationDataset class to a HuggingFace dataset.

Developer experience should look something like:

EvaluationDataset.from_huggingface_dataset(...)

Minimum score and score is the same on dashboard when logging

Add Microsoft Guidance Integration

Currently planning how this looks at the moment (and where we fit in with guardrails).

HallucinationMetric

To check for hallucination, we can perform the following:

Grab sources from a Google Search/Query
Run Factual Consistency On Top

Improve Overall Score

For Overall Score, implement the score breakdown to better understand what goes wrong with the score

Add unit tests for CLI

As the CLI flow gets more and more ironed out - will need to add tests to ensure the developer onboarding flow doesn't break.

Why are chunks required in FactualConsistencyMetric

Hey guys,

I just started today exploring you great library and was curious to understand the factual consistency metric.

Maybe i didnt got it right, but why do we have to create chunks of our context? It seems like they have no impact at all, since

scores = self.model.predict([(context, output), (output, context)])

is always called with context and output, hence, producing the same scoring results in each loop. The max_score can be found in the first loop iteration allready.

Code: deepeval/metrics/factual_consistency.py:19-32

def measure(self, output: str, context: str):
    context_list = chunk_text(context)
    max_score = 0
    for c in context_list:
        scores = self.model.predict([(context, output), (output, context)])
        print(scores)
        # https://huggingface.co/cross-encoder/nli-deberta-base
        # label_mapping = ["contradiction", "entailment", "neutral"]
        softmax_scores = softmax(scores)
        score = softmax_scores[0][1]
        if score > max_score:
            max_score = score

        second_score = softmax_scores[1][1]
        if second_score > max_score:
            max_score = second_score

Add alternative way to create project via CLI

Add RAGAS metrics to DeepEval

Add RAGAS metrics to DeepEval.
Key metrics that would be useful:

Ragas Score (Harmonic mean)
Context Recall (for the retriever)

https://github.com/explodinggradients/ragas

Add tone similarity

The use case for this is often businesses/enterprises will want to ensure that a specific sentence will match the tone in which the person said something. This check would be perfect.

Add GuardRails Tutorial

It would be useful to have DeepEval ML models powering the validators in GuardRail Ai

https://github.com/ShreyaR/guardrails

For this - I think it would be useful to just have a guide on how to write a Guardrail using DeepEval

This should be a fairly straightforward tutorial.

Flowise Integration

Add a "not sure" / "unwilling to answer" metric

Measure the amount of times an LLM is "unsure" of something or "unwilling" to answer. This is a growing pain point of LLMs. Not sure if research papers have caught up in this area unfortunately so some sort of Conceptual Similarity across a few such prompts should be the easiest way to do this.

Support new factual consistency model

Factual consistency can be significantly improved with a larger model - but this model can have issues when running in environments with limited GPU RAM. Have a section in the documentation for improved performances

Improve support for Jupyter notebook usage

Improve support for Jupyter notebooks by showing how to log data

Issues with running the tutorial https://docs.confident-ai.com/docs/tutorials/evaluating-langchain

The tutorial has a bug https://docs.confident-ai.com/docs/tutorials/evaluating-langchain

query = "Who is the president? should be query = "Who is the president?"

Versioning

Is your feature request related to a problem? Please describe.
If you are an engineer, it would be really important to be able to version and compare results as if it were a git branch

Describe the solution you'd like

git checkout -b feature/add-guidance

# To compare this against the most recent branch in terms of performance (which is saved/cached)
deepeval compare

# To compare the main branch
deepeval compare main

Ideally it should then output a table to compare results

Add ChatGPT Synthetic Data Creation

Add way to create synthetic data with ChatGPT and create an EvaluationDataset class that allows you to process things in bulk.

dataset = create_query_answer_pairs(text="""Your content goes here""",)

Expected output would be the evaluation dataset will be filled with TestCase classes. A TestCase should have a query and expected_output. See deepeval/test_case.py

Add functionality to calculate average score

Add functionality to calculate average score across multiple runs.

This ensures a fair comparison so you don't have to view the test run multiple times.

Pre-configuring Multiple Responses for a Conversation

Hello deepeval maintainers and community,

I am currently working on a project where I am building a chatbot to assist users in buying a product. I want to be able to evaluate the bot's responses in various conversation flows, and I was wondering if the deepeval library supports such a use case.

Here are the three specific flows I'd like to test:

Positive Flow: The user agrees with everything and always responds positively, essentially saying 'yes' to all prompts.
Inquisitive Flow: The user asks 1 to 3 out of 5 possible questions during the interaction.
Human Representative Request: Throughout the conversation, the user consistently asks to speak to a human representative.

Ideally, I would like to set up a mock "user" (which could be another bot) to communicate with the bot we're aiming to send to production. This would simulate these three scenarios and allow us to test our bot's responses.

Questions:

Does the deepeval library support this use case of pre-configuring multiple conversation flows?
If yes, could you provide any pointers or documentation links on how to set it up?
If no, do you have any plans in the future roadmap to support such functionality or do you know of any other tools/libraries that might assist with this?

Thank you for your time and looking forward to your response!

Improve LangChain Integration

With the improvements in the package, LangChain guide will need to be updated to demonstrate capabilities.

Adding a Langchain callback here

Record the aggregate metrics in the CLI

For the CLI, we want to be able to record the aggregate metrics at the end of a test run

Add CLI for AutoEvals

deepeval auto-generate sample.txt

This would save questions and answers inside of a CSV. it would generate questions for each line of the text file.

LiteLLM Docs

Hey guys,

I see some LiteLLM docs in this repo - curious, did y'all fork it? Totally cool if so, just wondering why fork vs. using the package?

Add text categorisation approach

Add a text categorisation approach based on AnyScale's blog article: https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper

Error logging conceptual similarity

Migrate from `dataset.run_evaluation` to `pytest.mark.parametrize`

Add Microsoft Guidance Integration

As Microsoft Guidance is a guidance language for controlling LLMs, an integration here could be quite useful.

Add test case name and filename

Add test case name and filename when logging test cases to the API

Auto-create evaluation dataset and edge cases

Develop an automated way to automatically create an evaluation dataset with edge cases so that users don't have to write tests. Then make it super easy to run!

New Design Plan:

Prompt to generate tests - ensure to include edge cases and RAG performance
Function to run evaluation on the generated tests
Improve the functionality

Add SQL Evaluation

Is your feature request related to a problem? Please describe.
Add a way to evaluate SQL queries based maximizing info gain while minimizing number of rows for synthetic query generation. Minimizing number of rows is important for

Describe the solution you'd like
Warning- this API is a WIP. Very open to suggestions.

from deepeval.sql import SQLEval
table = SQLEval.load_table(...)

Describe alternatives you've considered
Can't really see other alternatives for SQL tables right now.

Additional context
May require a bit of work around building SQL injection and also providing a frontend to make viewing the created table very simple.

Remove warnings

========================================================= warnings summary =========================================================
../../../../../opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204
  /opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204: PytestAssertRewriteWarning: Module already imported so
cannot be rewritten: deepeval
    self._mark_plugins_for_rewrite(hook)

../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an 
API
    warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)

../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to 
`pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See 
https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)
    ```

Add Evaluation For Structured Output

Add an evaluation framework for structured output for LLMs