Code Monkey home page Code Monkey logo

deepeval's Introduction

DeepEval Logo

discord-invite

GitHub release Try Quickstart in Colab License

DeepEval is a simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.

Want to talk LLM evaluation? Come join our discord.


๐Ÿ”ฅ Metrics and Features

  • Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that runs locally on your machine:
    • G-Eval
    • Summarization
    • Answer Relevancy
    • Faithfulness
    • Contextual Recall
    • Contextual Precision
    • RAGAS
    • Hallucination
    • Toxicity
    • Bias
    • etc.
  • Evaluate your entire dataset in bulk in under 20 lines of Python code in parallel. Do this via the CLI in a Pytest-like manner, or through our evaluate() function.
  • Create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
  • Integrates seamlessly with ANY CI/CD environment.
  • Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code., which includes:
    • MMLU
    • HellaSwag
    • DROP
    • BIG-Bench Hard
    • TruthfulQA
    • HumanEval
    • GSM8K
  • Automatically integrated with Confident AI for continous evaluation throughout the lifetime of your LLM (app):
    • log evaluation results and analyze metrics pass / fails
    • compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
    • debug evaluation results via LLM traces
    • manage evaluation test cases / datasets in one place
    • track events to identify live LLM responses in production
    • real-time evaluation in production
    • add production events to existing evaluation datasets to strength evals over time

(Note that while some metrics are for RAG, others are better for a fine-tuning use case. Make sure to consult our docs to pick the right metric.)


๐Ÿ”Œ Integrations


๐Ÿš€ QuickStart

Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.

Installation

pip install -U deepeval

Create an account (highly recommended)

Although optional, creating an account on our platform will allow you to log test results, enabling easy tracking of changes and performances over iterations. This step is optional, and you can run test cases even without logging in, but we highly recommend giving it a try.

To login, run:

deepeval login

Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy here).

Writing your first test case

Create a test file:

touch test_chatbot.py

Open test_chatbot.py and write your first test case using DeepEval:

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_case():
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        # Replace this with the actual output from your LLM application
        actual_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [answer_relevancy_metric])

Set your OPENAI_API_KEY as an environment variable (you can also evaluate using your own custom model, for more details visit this part of our docs):

export OPENAI_API_KEY="..."

And finally, run test_chatbot.py in the CLI:

deepeval test run test_chatbot.py

Your test should have passed โœ… Let's breakdown what happened.

  • The variable input mimics user input, and actual_output is a placeholder for your chatbot's intended output based on this query.
  • The variable retrieval_context contains the relevant information from your knowledge base, and AnswerRelevancyMetric(threshold=0.5) is an out-of-the-box metric provided by DeepEval. It helps evaluate the relevancy of your LLM output based on the provided context.
  • The metric score ranges from 0 - 1. The threshold=0.5 threshold ultimately determines whether your test has passed or not.

Read our documentation for more information on how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.


Evaluating Without Pytest Integration

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])

Using Standalone Metrics

DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)

answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
# Most metrics also offer an explanation
print(answer_relevancy_metric.reason)

Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.

Evaluating a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:

import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])

dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])

@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    hallucination_metric = HallucinationMetric(threshold=0.3)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_<filename>.py -n 4

Alternatively, although we recommend using deepeval test run, you can evaluate a dataset/test cases without using our Pytest integration:

from deepeval import evaluate
...

evaluate(dataset, [answer_relevancy_metric])
# or
dataset.evaluate([answer_relevancy_metric])

Real-time Evaluations on Confident AI

We offer a free web platform for you to:

  1. Log and view all the test results / metrics data from DeepEval's test runs.
  2. Debug evaluation results via LLM traces.
  3. Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
  4. Create, manage, and centralize your evaluation datasets.
  5. Track events in production and augment your evaluation dataset for continous evaluation.
  6. Track events in production, view evaluation results and historical insights.

Everything on Confident AI, including how to use Confident is available here.

To begin, login from the CLI:

deepeval login

Follow the instructions to log in, create your account, and paste your API key into the CLI.

Now, run your test file again:

deepeval test run test_chatbot.py

You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!

ok


Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.


Roadmap

Features:

  • Implement G-Eval
  • Referenceless Evaluation
  • Production Evaluation & Logging
  • Evaluation Dataset Creation

Integrations:

  • lLamaIndex
  • langChain
  • Guidance
  • Guardrails
  • EmbedChain

Authors

Built by the founders of Confident AI. Contact [email protected] for all enquiries.


License

DeepEval is licensed under Apache 2.0 - see the LICENSE.md file for details.

deepeval's People

Contributors

agokrani avatar andrea23romano avatar andresprez avatar anindyadeep avatar bderenzi avatar colabdog avatar deeds67 avatar donaldwasserman avatar elafo avatar fabian57fabian avatar j-space-b avatar ji21 avatar kelp710 avatar kritinv avatar krrishdholakia avatar kubre avatar lbux avatar mikkeyboi avatar navkar98 avatar nictuku avatar pedroallenrevez avatar peilun-li avatar penguine-ip avatar philipchung avatar pratyush-exe avatar rohinish404 avatar se-hun avatar shippy avatar vasilije1990 avatar vmesel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepeval's Issues

'Answer Length' metric

aka 'Brevity' metric. Simple character count of answer (this metric might be somewhere else in the code but I'm not seeing it?) Some models could score higher on relevancy but with more words (in theory) - an 'answer length' metric would control for that.

Litellm integration request

I think it's a good idea to integrate with the Litellm as it provides an easy to integrate with 100+ LLMs.

https://github.com/BerriAI/litellm

LLM as a drop in replacement for GPT. Use Azure, OpenAI, Cohere, Anthropic, Ollama, VLLM, Sagemaker, HuggingFace, Replicate (100+ LLMs)

Parallelized Tests

Support running parallelized tests with pytest xdist.

Desired developer experience:

# Distribute the number of workers to 4
deepeval test run test_sample.py -num_workers 4

Motivation:
One problem we're running into is it's currently taking too long to run unit tests for LLMs, and it's bad developer experience for engineers to wait for 5-10 minutes to get their test results.

For example, it takes around 30 seconds to generate a LLM output depending on the length of the response, and simply running 10 test (which isn't a lot) takes quite a while.

We're hoping to fix this with this issue.

Add Translation Similarity

Looking to potentially implement COMET's neural translation framework to compare against other AI companies.

Unbabel's COMET
https://github.com/Unbabel/COMET

This can be important to ensure consistent performance when changing models for different queries.

assert_translation_performance(...)

Add LLMEvalMetric

Prompt to use for LLMEvalMetric

We provide a question and the 'ground-truth' answer. We also provide \
the predicted answer.

Evaluate whether the predicted answer is correct, given its similarity \
to the ground-truth. If details provided in predicted answer are reflected \
in the ground-truth answer, return "YES". To return "YES", the details don't \
need to exactly match. Be lenient in evaluation if the predicted answer \
is missing a few details. Try to make sure that there are no blatant mistakes. \
Otherwise, return "NO".

Question: {question}
Ground-truth Answer: {gt_answer}
Predicted Answer: {pred_answer}
Evaluation Result: \

As featured from this guide:

https://gpt-index.readthedocs.io/en/latest/examples/finetuning/knowledge/finetune_knowledge.html

Cohere Integration

A super valuable integration would be Cohere's Reranker integration. Will need to check if Cohere has anything useful that might be able to evaluate LLMs.

error in command -> deepeval test run test_sample.py

โ ‹ Downloading models (may take up to 2 minutes if running for the first time)...Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 32, in run
self.live.refresh()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\live.py", line 241, in refresh
with self.console:
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 864, in exit
self._exit_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 822, in _exit_buffer
self._check_buffer()
File "C:\Users\jayit\deepeval\venv\lib\site-packages\rich\console.py", line 2038, in _check_buffer
write(text)
File "C:\Users\jayit\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u283c' in position 10: character maps to
*** You may need to add PYTHONIOENCODING=utf-8 to your environment ***
FF.s [100%]
============================================================ FAILURES =============================================================
_____________________________________________________________ test_1 ______________________________________________________________

def test_1():
    # Check to make sure it is relevant
    query = "What is the capital of France?"
    output = "The capital of France is Paris."
    metric = RandomMetric()
    # Comment this out for differne metrics/models
    # metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=query, output=output)
  assert_test(test_case, [metric])

test_sample.py:18:


venv\lib\site-packages\deepeval\run_test.py:252: in assert_test
return run_test(
venv\lib\site-packages\deepeval\run_test.py:239: in run_test
measure_metric()
venv\lib\site-packages\deepeval\retry.py:39: in wrapper
raise last_error # Raise the last error
venv\lib\site-packages\deepeval\retry.py:23: in wrapper
result = func(*args, **kwargs)


@retry(
    max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
    score = metric.measure(test_case)
    success = metric.is_successful()
    if isinstance(test_case, LLMTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            context=test_case.context if test_case.context else "-",
        )

        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            metadata=None,
            context=test_case.context,
        )
    elif isinstance(test_case, SearchTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=str(test_case.output_list)
            if test_case.output_list
            else "-",
            expected_output=str(test_case.golden_list)
            if test_case.golden_list
            else "-",
            context="-",
        )
        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output_list
            if test_case.output_list
            else "-",
            expected_output=test_case.golden_list
            if test_case.golden_list
            else "-",
            metadata=None,
            context="-",
        )
    else:
        raise ValueError("TestCase not supported yet.")
    test_results.append(test_result)

    if raise_error:
      assert (
            metric.is_successful()
        ), f"{metric.__name__} failed. Score: {score}."

E AssertionError: Random failed. Score: 0.2304173247532807.

venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2304173247532807.
Max retries (1) exceeded.
_____________________________________________________________ test_2 ______________________________________________________________

def test_2():
    # Check to make sure it is factually consistent
    output = "Cells have many major components, including the cell membrane, nucleus, mitochondria, and endoplasmic reticulum." 
    context = "Biology"
    metric = RandomMetric()
    # Comment this out for factual consistency tests
    # metric = FactualConsistencyMetric(minimum_score=0.8)
    test_case = LLMTestCase(output=output, context=context)
  assert_test(test_case, [metric])

test_sample.py:29:


venv\lib\site-packages\deepeval\run_test.py:252: in assert_test
return run_test(
venv\lib\site-packages\deepeval\run_test.py:239: in run_test
measure_metric()
venv\lib\site-packages\deepeval\retry.py:39: in wrapper
raise last_error # Raise the last error
venv\lib\site-packages\deepeval\retry.py:23: in wrapper
result = func(*args, **kwargs)


@retry(
    max_retries=max_retries, delay=delay, min_success=min_success
)
def measure_metric():
    score = metric.measure(test_case)
    success = metric.is_successful()
    if isinstance(test_case, LLMTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            context=test_case.context if test_case.context else "-",
        )

        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output if test_case.output else "-",
            expected_output=test_case.expected_output
            if test_case.expected_output
            else "-",
            metadata=None,
            context=test_case.context,
        )
    elif isinstance(test_case, SearchTestCase):
        log(
            success=success,
            score=score,
            metric=metric,
            query=test_case.query if test_case.query else "-",
            output=str(test_case.output_list)
            if test_case.output_list
            else "-",
            expected_output=str(test_case.golden_list)
            if test_case.golden_list
            else "-",
            context="-",
        )
        test_result = TestResult(
            success=success,
            score=score,
            metric_name=metric.__name__,
            query=test_case.query if test_case.query else "-",
            output=test_case.output_list
            if test_case.output_list
            else "-",
            expected_output=test_case.golden_list
            if test_case.golden_list
            else "-",
            metadata=None,
            context="-",
        )
    else:
        raise ValueError("TestCase not supported yet.")
    test_results.append(test_result)

    if raise_error:
      assert (
            metric.is_successful()
        ), f"{metric.__name__} failed. Score: {score}."

E AssertionError: Random failed. Score: 0.2175566574653277.

venv\lib\site-packages\deepeval\run_test.py:235: AssertionError
------------------------------------------------------ Captured stdout call -------------------------------------------------------
Attempt 1 failed: Random failed. Score: 0.2175566574653277.
Max retries (1) exceeded.
====================================================== slowest 10 durations =======================================================
4.99s call test_sample.py::test_1
2.37s call test_sample.py::test_2
2.32s call test_sample.py::test_3

(7 durations < 0.005s hidden. Use -vv to show these durations.)
===================================================== short test summary info =====================================================
FAILED test_sample.py::test_1 - AssertionError: Random failed. Score: 0.2304173247532807.
FAILED test_sample.py::test_2 - AssertionError: Random failed. Score: 0.2175566574653277.
2 failed, 1 passed, 1 skipped in 10.94s
โœ… Tests finished! View results on https://app.confident-ai.com/

Integration with HuggingFace Datasets

Integrate a EvaluationDataset class to a HuggingFace dataset.

Developer experience should look something like:

EvaluationDataset.from_huggingface_dataset(...)

HallucinationMetric

To check for hallucination, we can perform the following:

  • Grab sources from a Google Search/Query
  • Run Factual Consistency On Top

Improve Overall Score

For Overall Score, implement the score breakdown to better understand what goes wrong with the score

Add unit tests for CLI

As the CLI flow gets more and more ironed out - will need to add tests to ensure the developer onboarding flow doesn't break.

Why are chunks required in FactualConsistencyMetric

Hey guys,

I just started today exploring you great library and was curious to understand the factual consistency metric.

Maybe i didnt got it right, but why do we have to create chunks of our context? It seems like they have no impact at all, since

scores = self.model.predict([(context, output), (output, context)])

is always called with context and output, hence, producing the same scoring results in each loop. The max_score can be found in the first loop iteration allready.

Code: deepeval/metrics/factual_consistency.py:19-32

def measure(self, output: str, context: str):
    context_list = chunk_text(context)
    max_score = 0
    for c in context_list:
        scores = self.model.predict([(context, output), (output, context)])
        print(scores)
        # https://huggingface.co/cross-encoder/nli-deberta-base
        # label_mapping = ["contradiction", "entailment", "neutral"]
        softmax_scores = softmax(scores)
        score = softmax_scores[0][1]
        if score > max_score:
            max_score = score

        second_score = softmax_scores[1][1]
        if second_score > max_score:
            max_score = second_score

Add tone similarity

The use case for this is often businesses/enterprises will want to ensure that a specific sentence will match the tone in which the person said something. This check would be perfect.

Add a "not sure" / "unwilling to answer" metric

Measure the amount of times an LLM is "unsure" of something or "unwilling" to answer. This is a growing pain point of LLMs. Not sure if research papers have caught up in this area unfortunately so some sort of Conceptual Similarity across a few such prompts should be the easiest way to do this.

Support new factual consistency model

Factual consistency can be significantly improved with a larger model - but this model can have issues when running in environments with limited GPU RAM. Have a section in the documentation for improved performances

Versioning

Is your feature request related to a problem? Please describe.
If you are an engineer, it would be really important to be able to version and compare results as if it were a git branch

Describe the solution you'd like

git checkout -b feature/add-guidance

# To compare this against the most recent branch in terms of performance (which is saved/cached)
deepeval compare

# To compare the main branch
deepeval compare main 

Ideally it should then output a table to compare results

Add ChatGPT Synthetic Data Creation

Add way to create synthetic data with ChatGPT and create an EvaluationDataset class that allows you to process things in bulk.

dataset = create_query_answer_pairs(text="""Your content goes here""",)

Expected output would be the evaluation dataset will be filled with TestCase classes. A TestCase should have a query and expected_output. See deepeval/test_case.py

Pre-configuring Multiple Responses for a Conversation

Hello deepeval maintainers and community,

I am currently working on a project where I am building a chatbot to assist users in buying a product. I want to be able to evaluate the bot's responses in various conversation flows, and I was wondering if the deepeval library supports such a use case.

Here are the three specific flows I'd like to test:

  1. Positive Flow: The user agrees with everything and always responds positively, essentially saying 'yes' to all prompts.
  2. Inquisitive Flow: The user asks 1 to 3 out of 5 possible questions during the interaction.
  3. Human Representative Request: Throughout the conversation, the user consistently asks to speak to a human representative.

Ideally, I would like to set up a mock "user" (which could be another bot) to communicate with the bot we're aiming to send to production. This would simulate these three scenarios and allow us to test our bot's responses.

Questions:

  1. Does the deepeval library support this use case of pre-configuring multiple conversation flows?
  2. If yes, could you provide any pointers or documentation links on how to set it up?
  3. If no, do you have any plans in the future roadmap to support such functionality or do you know of any other tools/libraries that might assist with this?

Thank you for your time and looking forward to your response!

Improve LangChain Integration

With the improvements in the package, LangChain guide will need to be updated to demonstrate capabilities.

  • Adding a Langchain callback here

Add CLI for AutoEvals

deepeval auto-generate sample.txt

This would save questions and answers inside of a CSV. it would generate questions for each line of the text file.

LiteLLM Docs

Hey guys,

I see some LiteLLM docs in this repo - curious, did y'all fork it? Totally cool if so, just wondering why fork vs. using the package?

Auto-create evaluation dataset and edge cases

Develop an automated way to automatically create an evaluation dataset with edge cases so that users don't have to write tests. Then make it super easy to run!

New Design Plan:

  • Prompt to generate tests - ensure to include edge cases and RAG performance
  • Function to run evaluation on the generated tests
  • Improve the functionality

Add SQL Evaluation

Is your feature request related to a problem? Please describe.
Add a way to evaluate SQL queries based maximizing info gain while minimizing number of rows for synthetic query generation. Minimizing number of rows is important for

Describe the solution you'd like
Warning- this API is a WIP. Very open to suggestions.

from deepeval.sql import SQLEval
table = SQLEval.load_table(...)

Describe alternatives you've considered
Can't really see other alternatives for SQL tables right now.

Additional context
May require a bit of work around building SQL injection and also providing a frontend to make viewing the created table very simple.

Remove warnings

========================================================= warnings summary =========================================================
../../../../../opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204
  /opt/homebrew/lib/python3.11/site-packages/_pytest/config/__init__.py:1204: PytestAssertRewriteWarning: Module already imported so
cannot be rewritten: deepeval
    self._mark_plugins_for_rewrite(hook)

../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an 
API
    warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)

../../../../../opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870
  /opt/homebrew/lib/python3.11/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to 
`pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See 
https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)
    ```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.