Code Monkey home page Code Monkey logo

tonic_validate's Introduction

PyPi Version Stargazers Issues MIT License Email Schedule a meeting


Logo

Tonic Validate

A high performance LLM/RAG evaluation framework
Explore the docs »
Prepare your unstructured data for RAG »

Report Bug · Request Feature · Quick Start

Table of Contents
  1. About The Project
  2. Quick Start
  3. CI/CD
  4. Usage
  5. FAQ
  6. Contributing
  7. License
  8. Contact

About The Project

Tonic Validate is a framework for the evaluation of LLM outputs, such as Retrieval Augmented Generation (RAG) pipelines. Validate makes it easy to evaluate, track, and monitor your LLM and RAG applications. Validate allows you to evaluate your LLM outputs through the use of our provided metrics which measure everything from answer correctness to LLM hallucination. Additionally, Validate has an optional UI to visualize your evaluation results for easy tracking and monitoring.

Good RAG systems start with good inputs. Are you blocked on pre-processing messy, complex unstructured data into a standardized format for embedding and ingestion into your vector database?

Tonic Textual is a privacy-focused ETL for LLMs that standardizes unstructured data for AI development and uses proprietary NER models to create metadata tags that enable improved retreival performance via metadata filtering. If you're spending too much time on data preparation, we can help; reach out to us for a demo.

(back to top)

Quick Start

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

  1. Install Tonic Validate

    pip install tonic-validate
    
  2. Use the following code snippet to get started.

    from tonic_validate import ValidateScorer, Benchmark
    import os
    
    os.environ["OPENAI_API_KEY"] = "your-openai-key"
    
    # Function to simulate getting a response and context from your LLM
    # Replace this with your actual function call
    def get_llm_response(question):
        return {
            "llm_answer": "Paris",
            "llm_context_list": ["Paris is the capital of France."]
        }
    
    benchmark = Benchmark(questions=["What is the capital of France?"], answers=["Paris"])
    # Score the responses for each question and answer pair
    scorer = ValidateScorer()
    run = scorer.score(benchmark, get_llm_response)

This code snippet, creates a benchmark with one question and reference answer and then scores the answer. Providing a reference answer is not required for most metrics (see below Metrics table).

(back to top)

CI/CD

Many users find value in running evaluations during the code review/pull request process. You can create your own automation here using the snippet above and knowledge found in our documentation and this readme OR you can take advantage of our absolutely free Github Action in the Github Marketplace. The listing is here. It's easy to setup but if you have any questions, just create an issue in the corresponding repository.

(back to top)

Usage

Tonic Validate Metrics

Metrics are used to score your LLM's performance. Validate ships with many different metrics which are applicable to most RAG systems. You can create your own metrics as well by providing your own implementation of metric.py. To compute a metric, you must provide it data from your RAG application. The table below shows a few of the many metrics we offer with Tonic Validate. For more detail explanations of our metrics refer to our documentation.

Metric Name Inputs Score Range What does it measure?
Answer similarity score Question
Reference answer
LLM answer
0 to 5 How well the reference answer matches the LLM answer.
Retrieval precision Question
Retrieved Context
0 to 1 Whether the context retrieved is relevant to answer the given question.
Augmentation precision Question
Retrieved Context
LLM answer
0 to 1 Whether the relevant context is in the LLM answer.
Augmentation accuracy Retrieved Context
LLM answer
0 to 1 Whether all the context is in the LLM answer.
Answer consistency Retrieved Context
LLM answer
0 to 1 Whether there is information in the LLM answer that does not come from the context.
Latency Run Time 0 or 1 Measures how long it takes for the LLM to complete a request.
Contains Text LLM Answer 0 or 1 Checks whether or not response contains the given text.

Metric Inputs

Metric inputs in Tonic Validate are used to provide the metrics with the information they need to calculate performance. Below, we explain each input type and how to pass them into Tonic Validate's SDK.

Question

What is it: The question asked
How to use: You can provide the questions by passing them into the Benchmark via the questions argument.

from tonic_validate import Benchmark
benchmark = Benchmark(
    questions=["What is the capital of France?", "What is the capital of Germany?"]
)

Reference Answer

What is it: A prewritten answer that serves as the ground truth for how the RAG application should answer the question.
How to use: You can provide the reference answers by passing it into the Benchmark via the answers argument. Each reference answer must correspond to a given question. So if the reference answer is for the third question in the questions list, then the reference answer must also be the third item in the answers list. The only metric that requires a reference answer is the Answer Similarity Score

from tonic_validate import Benchmark
benchmark = Benchmark(
    questions=["What is the capital of France?", "What is the capital of Germany?"]
    answers=["Paris", "Berlin"]
)

LLM Answer

What is it: The answer the RAG application / LLM gives to the question.
How to use: You can provide the LLM answer via the callback you provide to the Validate scorer. The answer is the first item in the tuple response.

# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
    return {
        "llm_answer": "Paris",
        "llm_context_list": ["Paris is the capital of France."]
    }

# Score the responses
scorer = ValidateScorer()
run = scorer.score(benchmark, ask_rag)

If you are manually logging the answers without using the callback, then you can provide the LLM answer via llm_answer when creating the LLMResponse.

from tonic_validate import LLMResponse
# Save the responses into an array for scoring
responses = []
for item in benchmark:
    # llm_answer is the answer that LLM gives
    llm_response = LLMResponse(
        llm_answer="Paris",
        benchmark_item=item
    )
    responses.append(llm_response)

# Score the responses
scorer = ValidateScorer()
run = scorer.score_responses(responses)

Retrieved Context

What is it: The context that your RAG application retrieves when answering a given question. This context is what's put in the prompt by the RAG application to help the LLM answer the question.
How to use: You can provide the LLM's retrieved context list via the callback you provide to the Validate scorer. The answer is the second item in the tuple response. The retrieved context is always a list

# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
    return {
        "llm_answer": "Paris",
        "llm_context_list": ["Paris is the capital of France."]
    }

# Score the responses
scorer = ValidateScorer()
run = scorer.score(benchmark, ask_rag)

If you are manually logging the answers, then you can provide the LLM context via llm_context_list when creating the LLMResponse.

from tonic_validate import LLMResponse
# Save the responses into an array for scoring
responses = []
for item in benchmark:
    # llm_answer is the answer that LLM gives
    # llm_context_list is a list of the context that the LLM used to answer the question
    llm_response = LLMResponse(
        llm_answer="Paris",
        llm_context_list=["Paris is the capital of France."],
        benchmark_item=item
    )
    responses.append(llm_response)

# Score the responses
scorer = ValidateScorer()
run = scorer.score_responses(responses)

Run Time

What is it: Used for the latency metric to measure how long it took the LLM to respond.
How to use: If you are using the Validate scorer callback, then this metric is automatically calculated for you. If you are manually creating the LLM responses, then you need to provide how long the LLM took yourself via the run_time argument.

from tonic_validate import LLMResponse

# Save the responses into an array for scoring
responses = []
for item in benchmark:
    run_time = # Float representing how many seconds the LLM took to respond
    # llm_answer is the answer that LLM gives
    # llm_context_list is a list of the context that the LLM used to answer the question
    llm_response = LLMResponse(
        llm_answer="Paris",
        llm_context_list=["Paris is the capital of France."],
        benchmark_item=item
        run_time=run_time
    )
    responses.append(llm_response)

# Score the responses
scorer = ValidateScorer()
run = scorer.score_responses(responses)

(back to top)

Scoring With Metrics

Most metrics are scored with the assistance of a LLM. Validate supports OpenAI and Azure OpenAI but other LLMs can easily be integrated (just file an github issue against this repository).

Important: Setting up OpenAI Key for Scoring

In order to use OpenAI you must provide an OpenAI API Key.

import os
os.environ["OPENAI_API_KEY"] = "put-your-openai-api-key-here"

If you already have the OPENAI_API_KEY set in your system's environment variables then you can skip this step. Otherwise, please set the environment variable before proceeding.

Using Azure

If you are using Azure, instead of setting the OPENAI_API_KEY environment variable, you instead need to set AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT. AZURE_OPENAI_ENDPOINT is the endpoint url for your Azure OpenAI deployment and AZURE_OPENAI_API_KEY is your API key.

import os
os.environ["AZURE_OPENAI_API_KEY"] = "put-your-azure-openai-api-key-here"
os.environ["AZURE_OPENAI_ENDPOINT"] = "put-your-azure-endpoint-here"

Using Gemini

If you already have the GEMINI_API_KEY set in your system's environment variables then you can skip this step. Otherwise, please set the environment variable before proceeding.

import os
os.environ["GEMINI_API_KEY"] = "put-your-gemini-api-key-here"

Note that to use gemini, your Python version must be 3.9 or higher.

Using Claude

If you already have the ANTHROPIC_API_KEY set in your system's environment variables then you can skip this step. Otherwise, please set the environment variable before proceeding.

import os
os.environ["ANTHROPIC_API_KEY"] = "put-your-anthropic-api-key-here"

Setting up the Tonic Validate Scorer

To use metrics, instantiate an instance of ValidateScorer.

from tonic_validate import ValidateScorer
scorer = ValidateScorer()

The default model used for scoring metrics is GPT 4 Turbo. To change the OpenAI model, pass the OpenAI model name into the model_evaluator argument for ValidateScorer. You can also pass in custom metrics via an array of metrics.

from tonic_validate import ValidateScorer
from tonic_validate.metrics import AnswerConsistencyMetric, AnswerSimilarityMetric

scorer = ValidateScorer([
    AnswerConsistencyMetric(),
    AugmentationAccuracyMetric()
], model_evaluator="gpt-3.5-turbo")

You can also pass in other models like Google Gemini or Claude by setting the model_evaluator argument to the model name like so

scorer = ValidateScorer(model_evaluator="gemini/gemini-1.5-pro-latest")
scorer = ValidateScorer(model_evaluator="claude-3")

If an error occurs while scoring an item's metric, the score for that metric will be set to None. If you instead wish to have Tonic Validate throw an exception when there's an error scoring, then set fail_on_error to True in the constructor

scorer = ValidateScorer(fail_on_error=True)

Important: Using the scorer on Azure

If you are using Azure, you MUST set the model_evaluator argument to your deployment name like so

scorer = ValidateScorer(model_evaluator="your-deployment-name")

Running the Scorer

After you instantiate the ValidateScorer with your desired metrics, you can then score the metrics using the callback you defined earlier.

from tonic_validate import ValidateScorer, ValidateApi

# Function to simulate getting a response and context from your LLM
# Replace this with your actual function call
def get_rag_response(question):
    return {
        "llm_answer": "Paris",
        "llm_context_list": ["Paris is the capital of France."]
    }

# Score the responses
scorer = ValidateScorer()
run = scorer.score(benchmark, ask_rag)
Running the Scorer with manual logging

If you don't want to use the callback, you can instead log your answers manually by iterating over the benchmark and then score the answers.

from tonic_validate import ValidateScorer, LLMResponse

# Save the responses into an array for scoring
responses = []
for item in benchmark:
    llm_response = LLMResponse(
        llm_answer="Paris",
        llm_context_list=["Paris is the capital of France"],
        benchmark_item=item
    )
    responses.append(llm_response)

# Score the responses
scorer = ValidateScorer()
run = scorer.score_responses(responses)

(back to top)

Viewing the Results

There are two ways to view the results of a run.

Option 1: Print Out the Results

You can manually print out the results via python like so

print("Overall Scores")
print(run.overall_scores)
print("------")
for item in run.run_data:
    print("Question: ", item.reference_question)
    print("Answer: ", item.reference_answer)
    print("LLM Answer: ", item.llm_answer)
    print("LLM Context: ", item.llm_context)
    print("Scores: ", item.scores)
    print("------")

which outputs the following

Overall Scores
{'answer_consistency': 1.0, 'augmentation_accuracy': 1.0}
------
Question:  What is the capital of France?
Answer:  Paris
LLM Answer:  Paris
LLM Context:  ['Paris is the capital of France.']
Scores:  {'answer_consistency': 1.0, 'augmentation_accuracy': 1.0}
------
Question:  What is the capital of Spain?
Answer:  Madrid
LLM Answer:  Paris
LLM Context:  ['Paris is the capital of France.']
Scores:  {'answer_consistency': 1.0, 'augmentation_accuracy': 1.0}
------

Option 2: Use the Tonic Validate UI (Recommended, Free to Use)

You can easily view your run results by uploading them to our free to use UI. The main advantage of this method is the Tonic Validate UI provides graphing for your results along with additional visualization features. To sign up for the UI, go to here.

Once you sign up for the UI, you will go through an onboarding to create an API Key and Project.

Copy both the API Key and Project ID from the onboarding and insert it into the following code

from tonic_validate import ValidateApi
validate_api = ValidateApi("your-api-key")
validate_api.upload_run("your-project-id", run)

This will upload your run to the Tonic Validate UI where you can view the results. On the home page (as seen below) you can view the change in scores across runs over time.

You can also view the results of an individual run in the UI as well.

(back to top)

Telemetry

Tonic Validate collects minimal telemetry to help us figure out what users want and how they're using the product. We do not use any existing telemetry framework and instead created our own privacy focused setup. Only the following information is tracked

  • What metrics were used for a run
  • Number of questions in a run
  • Time taken for a run to be evaluated
  • Number of questions in a benchmark
  • SDK Version being used

We do NOT track things such as the contents of the questions / answers, your scores, or any other sensitive information. For detecting CI/CD, we only check for common environment variables in different CI/CD environments. We do not log the values of these environment variables.

We also generate a random UUID to help us figure out how many users are using the product. This UUID is linked to your Validate account only to help track who is using the SDK and UI at once and to get user counts. If you want to see how we implemented telemetry, you can do so in the tonic_validate/utils/telemetry.py file

If you wish to opt out of telemetry, you only need to set the TONIC_VALIDATE_DO_NOT_TRACK environment variable to True.

(back to top)

FAQ

What models can I use an LLM evaluator?

We currently allow the family of chat completion models from Open AI, Google, Anthropic, and more. We are always looking to add more models to our evaluator. If you have a model you would like to see added, please file an issue against this repository.

We'd like to add more models as choices for the LLM evaluator without adding to the complexity of the package too much.

The default model used for scoring metrics is GPT 4 Turbo. To change the model, pass the model name into the model argument for ValidateScorer

scorer = ValidateScorer([
    AnswerConsistencyMetric(),
    AugmentationAccuracyMetric()
], model_evaluator="gpt-3.5-turbo")

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

LinkedIn Email Schedule a meeting

(back to top)

tonic_validate's People

Contributors

ethan-tonic avatar joeferraratonic avatar akamor avatar dependabot[bot] avatar azizcode92 avatar ryan-rishi avatar calexander3 avatar gmuffiness avatar lyon-tonic avatar oguzhanmeteozturk avatar anuj-tonic avatar ndb-rkang avatar

Stargazers

felix-wang avatar  avatar Elmira Ghorbani avatar Mohammad Sharaf Hijjeh avatar David Iommi avatar osllm.ai avatar Niko Pang avatar  avatar Kaylynn Watson avatar Pazuzzu avatar  avatar  avatar  avatar Alex avatar shekhar avatar Balaji Venkataraman (xbalajipge) avatar Ashish Kumar avatar Nigel Mathes avatar James Long avatar Love avatar David Cody avatar Karthik Mahesh Rathod avatar Andriy Polukhin avatar stomanin avatar Rodrigo Gonzalez avatar Łukasz Augustyniak avatar Yuze Lou avatar Jeff Carpenter avatar Hwanjun Song avatar StarlinkTuc avatar  avatar Shashi Kumar Nagulakonda avatar Ce Gao avatar  avatar Fanghua(Joshua) Yu avatar Brandon Jernigan avatar ted stinson avatar Arjun Aravindan avatar Kyeongpil Kang avatar Andrew Colombi avatar  avatar Jinjing Zhou avatar  avatar FJ avatar Veera Vignesh avatar Marco d. L. K. avatar Sridhar Ramaswamy avatar Borko avatar Isaka Traore avatar  avatar tom.ling avatar  avatar Ryan k avatar Ryan Peach avatar  avatar styg avatar Kasey Alderete avatar Jinwoo Jeong avatar Ryan Reeves avatar Allsian avatar Lestan avatar Aviad avatar Mehdi avatar Alex avatar Yuri Shadunsky avatar Virat Singh avatar Stephen Vector avatar Roger Pou avatar Oliver Mannion avatar Wildan Zulfikar avatar Eugene Klimov avatar Omar Ali Sheikh-Omar avatar Andy Ainsworth avatar Felix Dittrich avatar  avatar Petar Ojdrovic avatar  avatar Jindrich Gavenda avatar Bryce Freshcorn avatar  avatar Gil Klein avatar 蔡大锅 avatar SANKAR N avatar Eric avatar Marc avatar Shuken avatar  avatar W. avatar Kamdoum Ngamgoum Franck Junior avatar Alessandro Usseglio Viretta avatar Ke Meng  avatar jaxolingo avatar  avatar  avatar Gábor Fodor avatar Duc Le avatar  avatar  avatar Jeffrey (Dongkyu) Kim avatar Juan Pablo Manson avatar

Watchers

kelvin avatar Alden DoRosario avatar Johnny Goodnow avatar Andrew Colombi avatar Mitchell Caisse avatar Eric Timmerman avatar Patrick avatar  avatar Michael R Nall avatar  avatar  avatar

tonic_validate's Issues

When OpenAI evaluator errors out sometimes, the integrity of the entire benchmark results come into question

for metric in self.metrics:

So one problem we are seeing is, when the OpenAI evaluator call errors out (and returns 0 say), the full benchmark results come into question.

For example, if we run the full PaulGrahamEssayDataset dataset with 44 questions, we will notice that some of the metrics are straight out 0 -- because the OpenAI evaluator errored out (for whatever reason). This put the final overall scores into question.

To get around this, we had to NOT send all the 44 responses in one call to scorer.score_run -- but rather do it one at a time and check to make sure the evaluator has not errored out. However, this approach is not ideal.

What would be nice is if the error checking for the evaluator (say via a retries mechanism) happens in tonic_validate/utils/llm_calls.py - so that resiliency is captured at that level.

The reason this is critical is because there is a big difference between a score of "0" and "error" when it comes to the final benchmarking results.

Other Metrics

Adam suggested I open an issue to brainstorm other possible metrics .. given the upcoming focus around AI safety, here are some more I was thinking about:

  1. Bias : Use a single evaluator call to check Bias against age, race, gender, sexual orientation, culture and other DEI factors.
    a. I would suggest scoring each factor on a score of 1-5 (rather than 0 to 5). You can also ask the evaluator to explain the scoring in the prompt (debugging purposes)

  2. Legal and Ethical Compliance Checks: Ensure that all responses adhere to legal and ethical standards, particularly regarding privacy, confidentiality, and regulatory compliance.
    a. Having a couple of metrics like PII and PHI scoring for these factors.

  3. Other possible metrics:

    1. Jailbreaking: jailbreak attempts, prompt injections, and LLM refusals of service
    2. Toxicity
    3. Hate Speech, Harassment, Sexually Explicit, Dangerous Content (see Google's list below)
    4. PII leakage

REFERENCES

  1. https://glassboxmedicine.com/2023/11/28/bias-toxicity-and-jailbreaking-large-language-models-llms/
  2. https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/configure-safety-attributes
  3. https://docs.whylabs.ai/docs/langkit-features/
  4. https://superwise.ai/llm-monitoring/

ModuleNotFoundError: No module named 'async_lru'

After upgrading to v4.0.1, I run into this error when I try to use the evals

(.venv) C:\Users\TurnerZ\Documents\GitHub\amaliai-hr>python C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\app\src\eval_perf.py
Traceback (most recent call last):
  File "C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\app\src\eval_perf.py", line 6, in <module>
    from tonic_validate import ValidateScorer, Benchmark, LLMResponse, ValidateApi
  File "C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\.venv\Lib\site-packages\tonic_validate\__init__.py", line 2, in <module>
    from .validate_scorer import ValidateScorer
  File "C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\.venv\Lib\site-packages\tonic_validate\validate_scorer.py", line 12, in <module>
    from tonic_validate.metrics.answer_consistency_metric import AnswerConsistencyMetric
  File "C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\.venv\Lib\site-packages\tonic_validate\metrics\__init__.py", line 1, in <module>
    from .answer_consistency_binary_metric import AnswerConsistencyBinaryMetric
  File "C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\.venv\Lib\site-packages\tonic_validate\metrics\answer_consistency_binary_metric.py", line 3, in <module>
    from tonic_validate.metrics.binary_metric import BinaryMetric
  File "C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\.venv\Lib\site-packages\tonic_validate\metrics\binary_metric.py", line 5, in <module>
    from tonic_validate.metrics.metric import Metric
  File "C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\.venv\Lib\site-packages\tonic_validate\metrics\metric.py", line 4, in <module>
    from tonic_validate.services.openai_service import OpenAIService
  File "C:\Users\TurnerZ\Documents\GitHub\amaliai-hr\.venv\Lib\site-packages\tonic_validate\services\openai_service.py", line 5, in <module>
    from async_lru import alru_cache
ModuleNotFoundError: No module named 'async_lru'

tonic-validate 4.0.1

Retrieval precision

I want to know the Retrieval precision.
I try many times. the result always show answer_similarity,augmentation_precision,answer_consistency.No other indicators available
RunData(scores={'answer_similarity': None, 'augmentation_precision': 1.0, 'answer_consistency': 0.0}
image
so i want to know if I should pass in any new indicator data or data.can you give me an example? thank you very much.

Support fine-grained metrics

Hello, could you please add support for more fine-grained metrics, possibly user-definable, like time to get embeddings, time to retrieve chunks from a vector database, time to get first bytes from streaming LLM etc.? That would give us a much better view of important metrics and possible regressions.

Support history display in tonic_validate UI

When we eval the performance of agent in llama-index, we will include a chat history in each test case. If those chat histories can be displayed in the tonic_validate UI, it will be very helpful.

Fix imports to include new classes

We have added several new classes like CallbackLLMResponse which aren't exported by our __init__.py files. We should fix this to export them to make it easier to import these classes.

Validate Score Values

Add validation to the metric scores to ensure that they fall in the range that they say they fall in to.

We do have validation to see if the scores returned from the LLMs can be parsed into floats. If the scores cannot be parsed into floats, then an error message is logged and a default score of 0.0 is returned. This issue is asking to add additional validation that in addition to the scores being parsable as floats, they also fall into the required range.

Expose LLM Evaluator Prompts

Currently the LLM evaluator prompts for a given metric are not exposed by the class that calculates the metric. Make it so that these prompts are explicitly accessible from the metric class.

This change will improve the tonic_validate integration in llama_index, as the evaluators in llama_index have methods that return these prompts. Once this change is made, a subsequent PR to llama_index should be made to add this functionality into the tonic_validate integration there.

Reference: run-llama/llama_index#10000 (comment)

ModuleNotFoundError

I run into this error when I try to set up validate

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 6
      4 from os import environ
      5 from rag import get_rag_response, load_cloud_qdrant_index, create_cloud_qdrant_index
----> 6 from tonic_validate import ValidateScorer, Benchmark, LLMResponse, ValidateApi
      7 from tonic_validate.metrics import (
      8     AnswerSimilarityMetric,
      9     RetrievalPrecisionMetric,
   (...)
     13     LatencyMetric
     14 )
     15 from llama_index.postprocessor.cohere_rerank import CohereRerank

File c:\Users\TurnerZ\Documents\GitHub\amaliai-hr\.venv\Lib\site-packages\tonic_validate\__init__.py:2
      1 from .validate_api import ValidateApi
----> 2 from .validate_scorer import ValidateScorer
      3 from .classes import (
      4     Benchmark,
      5     BenchmarkItem,
   (...)
     11     UserInfo,
     12 )
     14 __all__ = [
     15     "ValidateApi",
...
     14 from openai.types.beta.thread_create_params import (
     15     Message as OpenAICreateThreadParamsMessage,
     16 )

ModuleNotFoundError: No module named 'openai.types.beta.threads.message_content'

tonic-validate version: 5.0.0

Config runs on import

Right now, when setting up the env vars in config it happens to run on import and not when the classes using the config are instantiated. E.G. when doing from tonic_validate import Benchmark, ValidateApi, ValidateScorer then config.py will run which sets up the env vars for validate prematurely.

Have only one paul_graham_essays folder

There's currently a paul_graham_essays folder in the examples folder that contains 6 essays used for the quickstart example notebook and another paul_graham_essays folder in the titan_vs_cohere folder that contains all the Paul Graham essays (many more than 6), which are used in our head to head RAG evaluation blog posts. Having two folders with the same name and different content creates confusion. They should be consolidated into one folder.

Anti-Hallucination Score - Switched?

https://github.com/TonicAI/tonic_validate/blob/058ed170d6f3666f15f398a5349732c9f7a80a2f/tonic_validate/utils/llm_calls.py#L90C5-L97C6

Guys -- is it possible that this code has switched the metric value?

So: If the answer contains information that cannot be attributed to the context then respond with true.

This means: When there IS hallucination, the score is "1".

While the metric signifies "anti-hallucination" -- which means high scores are good, low scores are bad.

https://docs.tonic.ai/validate/about-rag-metrics/tonic-validate-rag-metrics-reference

Answer consistency binary
from tonic_validate.metrics import AnswerConsistencyBinaryMetric
Answer consistency binary indicates whether all of the information in the answer is derived from the retrieved context.
If all of the information in the answer comes from the retrieved context, then the answer is consistent with the context, and the value of this metric is 1.
If the answer contains information that is not derived from the context, then the value of this metric is 0.
To calculate answer consistency binary, we ask an LLM whether the RAG system answer contains any information that is not derived from the context.
Answer consistency binary is a binary integer.

https://github.com/TonicAI/tonic_validate/blob/main/tonic_validate/metrics/answer_consistency_binary_metric.py#L11C7-L11C36

Returns
        -------
        int
            0 if there is information in answer not derived from context in context_list
            1 otherwise

https://github.com/TonicAI/tonic_validate/blob/main/tonic_validate/utils/llm_calls.py#L90C5-L97C6

main_message = (
        "Considering the following list of context and then answer, which answers a"
        "user's query using the context, determine whether the answer contains any"
        "information that can not be attributed to the intormation in the list of"
        "context. If the answer contains information that cannot be attributed to the"
        "context then respond with true. Otherwise response with false. Response with"
        "either true or false and no additional text."
    )

Detect Validate GH Actions

We need to check if the user is using one of our default github actions that we provide for Tonic Validate to help us gauge usage of them

Add Telemetry

We are adding in very basic telemetry to help us get an idea of what users want in the product. For privacy reasons, we are rolling our own telemetry solution instead of using existing solutions. Only the following information will be logged by the telemetry

  • What metrics were used for a run
  • Number of questions in a run
  • Number of questions in a benchmark

We will NOT track things such as the contents of the questions / answers, scores, or any other sensitive information. We will only track the list of metrics and the number of questions/benchmarks.

Python3.8 imcompatible typing

Tonic seems to support python3.8 judging by the pyproject.toml, but looks like some incompatible typing that breaks python3.8 or less

/home/runner/.cache/pants/named_caches/pex_root/venvs/s/5ac6c5cf/venv/lib/python3.8/site-packages/tonic_validate/classes/llm_response.py:10: in LLMResponse
    llm_context_list: list[str]
E   TypeError: 'type' object is not subscriptable

Need to type with
from typing import List

500 error when passing data to the dashboard using validate_api

Hey - noticed a problem when trying to log the run results using validate_api
`/usr/local/lib/python3.10/dist-packages/tonic_validate/validate_api.py in upload_run(self, project_id, run, run_metadata)
57 )
58 for run_data in run.run_data:
---> 59 _ = self.client.http_post(
60 f"/projects/{project_id}/runs/{run_response['id']}/logs",
61 data=run_data.to_dict(),

/usr/local/lib/python3.10/dist-packages/tonic_validate/utils/http_client.py in http_post(self, url, params, data)
61 verify=False,
62 )
---> 63 res.raise_for_status()
64 return res.json()
65

/usr/local/lib/python3.10/dist-packages/requests/models.py in raise_for_status(self)
1019
1020 if http_error_msg:
-> 1021 raise HTTPError(http_error_msg, response=self)
1022
1023 def close(self):

HTTPError: 500 Server Error: Internal Server Error for url: https://validate.tonic.ai/api/v1/projects/projId/runs/runId/logs`

Here's our input:
scorer = ValidateScorer([AnswerConsistencyMetric(), AnswerConsistencyBinaryMetric(), AugmentationPrecisionMetric(), AugmentationAccuracyMetric(), RetrievalPrecisionMetric()])
run = scorer.score(benchmark, get_llm_response)

Request for OpenAI Assistant Examples

This is not an issue but a request for further examples associated with testing OpenAI Assistants.

Currently, there is one example which compares a CustomGPT vs Assistant but I was looking for more if possible. The example shows the use of AnswerSimilarityMetric and curious if there are anymore examples you can share on Validating an Assistants performance using other metrics.

Thanks!

Make PyPi README match Github

Right now, the PyPi readme is just a link to Github which isn't very helpful. We should merge the two READMEs to make it easier to see the documentation.

Detect CI/CD

We should add CI/CD detection to help us distinguish between CI runs and local runs. This will allow us to filter through CI usage to get a better idea of how many people are using Tonic Validate.

Add multiple runs per question and report average/stdev

Hello, please add the ability to have a fixed number of runs per question instead of 1 and report average and stdev of all metrics (perhaps min/max or some sort of a histogram as well). That would allow avoiding outliers in the testing process like network connection issues, LLM temperature effect etc.

Metric classes for metrics that do and do not use LLM assisted evaluation

#69 introduced a bunch of new metrics, some of which use LLM assisted evaluation and some of which do not. This issue is to create subclasses of the Metric classes for metrics that use LLM assisted evaluation and metrics that do not use LLM assisted evaluation so that only metrics that use LLM assisted evaluation need an LLM service to be used.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.