stanford-futuredata / ares Goto Github PK

License: Apache License 2.0

Python 100.00%

ares's Introduction

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Table of Contents: Installation | Requirements | Quick Start | Citation

ARES is a groundbreaking framework for evaluating Retrieval-Augmented Generation (RAG) models. The automated process combines synthetic data generation with fine-tuned classifiers to efficiently assess context relevance, answer faithfulness, and answer relevance, minimizing the need for extensive human annotations. ARES employs synthetic query generation and Prediction-Powered Inference (PPI), providing accurate evaluations with statistical confidence.

💬 Mini Q&A

What does ARES assess in RAG models?

ARES conducts a comprehensive evaluation of Retrieval-Augmented Generation (RAG) models, assessing the systems for context relevance, answer faithfulness, and answer relevance. This thorough assessment ensures a complete understanding of the performance of the RAG system.

How does ARES automate the evaluation process?

ARES minimizes the need for human labeling by leveraging fine-tuned classifiers and synthetic data. Its PPI component, Prediction-Powered inference, refines evaluations considering model response variability and provides statistical confidence in the results. By using fine-tuned classifiers and synthetically generated data, ARES cuts down on human labeling needs while providing accurate assessments.

Can ARES handle my custom RAG model?

Yes, ARES is a model-agnostic tool that enables you to generate synthetic queries and answers from your documents. With ARES, you can evaluate these generated queries and answers from your RAG model.

⚙️ Installation

To install ARES, run the following commands:

pip install ares-ai

Optional: Initalize OpenAI or TogetherAI API key with the following command:

export OPENAI_API_KEY=<your key here>
export TOGETHER_API_KEY=<your key here>

📝 Requirements

To implement ARES for scoring your RAG system and comparing to other RAG configurations, you need three components:

A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal.
A set of few-shot examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system
A much larger set of unlabeled query-document-answer triples outputted by your RAG system for scoring

To get started with ARES, you'll need to set up your configuration. Below is an example of a configuration for ARES!

Copy-paste each step to see ARES in action!

📥 Download datasets

Use the following command to quickly obtain the necessary files for getting started! This includes the 'few_shot_prompt' file for judge scoring and synthetic query generation, as well as both labeled and unlabeled datasets.

wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_judge_scoring.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_few_shot_prompt_for_synthetic_query_generation.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/main/datasets/example_files/nq_unlabeled_output.tsv

OPTIONAL: You can run the following command to get the full NQ dataset! (347 MB)

from ares import ARES
ares = ARES() 
ares.KILT_dataset("nq")

# Fetches NQ datasets with ratios including 0.5, 0.6, 0.7, etc.
# For purposes of our quick start guide, we rename nq_ratio_0.5 to nq_unlabeled_output and nq_labeled_output.

🚀 Quick Start - #1

To get started with ARES's PPI, you'll need to set up your configuration. Below is an example of a configuration for ARES!

Just copy-paste as you go to see ARES in action!

Step 1) Run the following to retrieve the UES/IDP scores with GPT3.5!

from ares import ARES

ues_idp_config = {
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv", 
    "model_choice" : "gpt-3.5-turbo-0125"
} 

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

Step 2) Run the following to retrive ARES's PPI scores with GPT3.5!

ppi_config = { 
    "evaluation_datasets": ['nq_unlabeled_output.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "llm_judge": "gpt-3.5-turbo-1106",
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

🚀 Quick Start - #2

Step 1) Run the following to see GPT 3.5's accuracy on the NQ unlabeled dataset!

from ares import ARES

ues_idp_config = {
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv", 
    "model_choice" : "gpt-3.5-turbo-0125"
} 

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

Step 2) Run the following to see ARES's synthetic generation in action!

from ares import ARES

synth_config = { 
    "document_filepaths": ["nq_labeled_output.tsv"] ,
    "few_shot_prompt_filename": "nq_few_shot_prompt_for_synthetic_query_generation.tsv",
    "synthetic_queries_filenames": ["synthetic_queries_1.tsv"], 
    "documents_sampled": 6189
}

ares_module = ARES(synthetic_query_generator=synth_config)
results = ares_module.generate_synthetic_data()
print(results)

Step 3) Run the following to see ARES's training classifier in action!

from ares import ARES

classifier_config = {
    "training_dataset": ["synthetic_queries_1.tsv"], 
    "validation_set": ["nq_labeled_output.tsv"], 
    "label_column": ["Context_Relevance_Label"], 
    "num_epochs": 10, 
    "patience_value": 3, 
    "learning_rate": 5e-6,
    "assigned_batch_size": 1,  
    "gradient_accumulation_multiplier": 32,  
}

ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)

Note: This code creates a checkpoint for the trained classifier. Training may take some time. You can download our jointly trained checkpoint on context relevance here!: Download Checkpoint

Step 4) Run the following to see ARES's PPI in action!

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_unlabeled_output.tsv'], 
    "checkpoints": ["Context_Relevance_Label_nq_labeled_output_date_time.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

# Output Should be: 
""" 
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300
"""

🚀 Local Model Execution with vLLM

ARES supports vLLM, allowing for local execution of LLM models, offering enhanced privacy and the ability to operate ARES offline. Below are steps to vLLM for ARES's UES/IDP and PPI!

1) UES/IDP w/ vLLM

from ares import ARES

ues_idp_config = {
    "in_domain_prompts_dataset": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv", 
    "model_choice": "meta-llama/Llama-2-13b-hf", # Specify vLLM model
    "vllm": True, # Toggle vLLM to True 
    "host_url": "http://0.0.0.0:8000/v1" # Replace with server hosting model followed by "/v1"
} 

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)

2) PPI w/ vLLM

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_unabeled_output.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "llm_judge": "meta-llama/Llama-2-13b-hf", # Specify vLLM model
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv",
    "vllm": True, # Toggle vLLM to True 
    "host_url": "http://0.0.0.0:8000/v1" # Replace with server hosting model followed by "/v1"
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

For more details, refer to our documentation.

Results Replication

We include synthetic datasets for key experimental results in synthetic_datasets. The few-shot prompts used for generation and evaluation are included in datasets. We also include instructions for fine-tuning LLM judges in the paper itself. Please reach out to [email protected] or [email protected] if you have any further questions.

Citation

To cite our work, please use the following Bibtex:

@misc{saadfalcon2023ares,
      title={ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems}, 
      author={Jon Saad-Falcon and Omar Khattab and Christopher Potts and Matei Zaharia},
      year={2023},
      eprint={2311.09476},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Appendix

Machine requirements and setup when not using OpenAI API

Machine requirements

Over ~100 GB of available disk space
GPU
- Should work: A100 (e.g. Standard_NC24ads_A100_v4 on Azure)
- Does not work:
  - Tested on 2023-12-17 with both Standard_NC6s_v3 and Standard_NC12s_v3, and ran into this error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.77 GiB total capacity; 15.12 GiB already allocated; 95.44 MiB free; 15.12 GiB reserved in total by PyTorch)

Machine setup

For example, on an Azure VM running Linux (ubuntu 20.04), you will need to do the following:

Install conda
- First set of commands (can copy-paste multiple lines)
  - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  - chmod +x Miniconda3-latest-Linux-x86_64.sh
  - ./Miniconda3-latest-Linux-x86_64.sh -b
- Second set of commands (can copy-paste multiple lines)
  - export PATH="~/miniconda3/bin:$PATH"
  - conda init
Install gcc
- sudo apt-get -y update
- sudo apt-get -y upgrade
- sudo apt-get -y install build-essential
- sudo apt-get -y install libpcre3-dev
Install NVIDIA drivers
- sudo apt install ubuntu-drivers-common -y
- sudo ubuntu-drivers autoinstall
- sudo reboot
- SSH in again and confirm the installation was successful by running nvidia-smi
cd to ARES folder and follow the rest of the README

ares's People

Contributors

Stargazers

Watchers

Forkers

thyarcanist codeaudit touristshaun sorokinvld jxzhangjhu ulyssebottello ttb-git alexisdeschamps philipfhuang kunlun-zhu menonpg cserveca techthiyanes aevedis philipp36 cskrishna blbvanas 3a1b2c3 onionkiller wj44 pmitra01 tm17-abcgen diogodsa droidcraft danny802002 imyasars vicesilva logicaltrojan hanwsf roysh robbym-dev odoochain zhoumz123 allthingsllm kaushikabhishek87 sk2001git trevorkang elsatch lbux zimonexla

ares's Issues

Evaluating more than one dataset at a time returns incorrect results

Last weeks I have been evaluating the evaluation features of ARES without achieving the expected results. The errors I've found are related to #44, which was marked as closed, but never solved.

Given that the current status of the code (ares-ai pypi library 0.6.1) makes impossible to get proper ARES Ranking for different datasets in the final results, I decided to explore further.

Baseline

To establish an initial baseline, I executed the reference code from the Quick Start Guide 2. This is the relevant code:

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_unlabeled_output.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

The NQ datasets were downloaded using the wget commands from the setup part of the guide. The checkpoint wasn't trained but downloaded from the provided drive link.

These are the results:

Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300

Test - Evaluating more than one dataset at a time

To test this example, we will download two different datasets from the NQ dataset, available from the repository at datasets/eval_datasets/nq, using the following commands:

wget https://github.com/stanford-futuredata/ARES/raw/main/datasets/eval_datasets/nq/nq_ratio_0.65.tsv
wget https://github.com/stanford-futuredata/ARES/raw/main/datasets/eval_datasets/nq/nq_ratio_0.7.tsv

This is the resulting code:

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

And these are the results:

--------------------------------------------------------
Evaluation Sets: ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv']
Checkpoints: ['checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt']
Labels: ['Context_Relevance_Label']
--------------------------------------------------------
[...]
--------------------------------------------------
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6354300416564624]
ARES Confidence Interval: [[0.577, 0.694]]
Number of Examples in Evaluation Set: [4081]
Ground Truth Performance: [0.65]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.792]
Annotated Examples used for PPI: 300
--------------------------------------------------
[...]
--------------------------------------------------
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6354300416564624, 0.6638786279683391]
ARES Confidence Interval: [[0.577, 0.694], [0.605, 0.722]]
Number of Examples in Evaluation Set: [4081, 3790]
Ground Truth Performance: [0.65, 0.7]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.792, 0.798]
Annotated Examples used for PPI: 300
--------------------------------------------------
# Reformated to make clear that the results are duplicated
[{'ARES_Prediction': 0.6354300416564624, 'ARES_Confidence_Interval': [0.577, 0.694], 'Number_of_Examples_in_Evaluation_Set': 4081, 'Ground_Truth_Performance': 0.65, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.792, 'Annotated_Examples_used_for_PPI': 300},
 {'ARES_Prediction': 0.6354300416564624, 'ARES_Confidence_Interval': [0.577, 0.694], 'Number_of_Examples_in_Evaluation_Set': 4081, 'Ground_Truth_Performance': 0.65, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.792, 'Annotated_Examples_used_for_PPI': 300}]

The evaluation returns first the results for the first dataset, then appends the results for the second dataset. Then in the final recap, it doubles the initial score into the second result, returning incorrect results.

This problem multiplies when analyzing several datasets and several labels. In that case the evaluation system keeps providing incorrect results when evaluating more than one dataset at a time. It overwrites the results of the second dataset with the results of the first dataset for the same label.

[
    # First label - First dataset
    {
        "ARES_Prediction": 0.6354300416564624,
        "ARES_Confidence_Interval": [0.577, 0.694],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # First label - should be second dataset. Duplicated
    {
        "ARES_Prediction": 0.6354300416564624,
        "ARES_Confidence_Interval": [0.577, 0.694],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # Second label - First dataset
    {
        "ARES_Prediction": 0.5664216286857816,
        "ARES_Confidence_Interval": [0.51, 0.622],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # Second label - should be second dataset. Duplicated
    {
        "ARES_Prediction": 0.5664216286857816,
        "ARES_Confidence_Interval": [0.51, 0.622],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
        "Annotated_Examples_used_for_PPI": 300,
    },
]

Missing protobuf dependencies in ARES 0.6.1 PyPi package

I have created a new Python environment and installed the 0.6.1 version of the Python library using PyPi.

When running the sample code for ares.evaluate_RAG() for the first time I got an import error:

ImportError: 
DebertaV2Converter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.

I installed the protobuf library using pip install protobuf and that solved the problem. So protobuf should be added to the requirements for ares-ai library.

Documentation and code are so broken!

Hi,

I have tried to reproduce the paper, or more specifically, follow the step by step instructions and unfortunately, nothing works.

As for the things I've detected so far in the python script version:

1.- Current requirements.txt can't be installed as instructed by the README.md file. because of conflicting library version.
2.- Sample document_filepath.tsv file in example_files has 6 examples and the column "Documents".
3.- Synthetic generation example code fails as the number of documents sampled is less that the given number --documents_sampled 10000
4.- If you change the number of documents_sampled to 5, so it doesn't fail, it will fail later as the step to generate the negative alternative requires at least 100 samples

So with the given documents in the example_files folder, it's impossible to generate a synthetic dataset.

Following the new vercel documentation at https://ares-ai.vercel.app/synth_gen/ is an absolute hit and miss, because of the copy pasted regions. For example in this page https://ares-ai.vercel.app/synth_gen/

The document paths alternate between data and /data, output and /output making the sample code fail
Sample dataset name is not correct. In your repo you have nq_ratio_0.6_.tsv and nq_ratio_0.5.tsv, but documentation uses nq_ratio_0.5_.tsv
Both the nq_ratio_0.5_.tsv and the nq_ratio_0.6_.tsv have less than 10000 documents, so the example command fails.
In the model choice, this section is copied right from the training classifier section and offers incorrect information.

But to make things even worse, the Python code in the ares-ai library is different from the Python scripts so if you try to run the code using the example_files/document_filepath.tsv this will fail too!! In the original file, you only need to pass a "Document" column so that ARES would generate the synthetic dataset, but now you also require a Query, Answer columns. Otherwise you would get the following error:

Error: The DataFrame is missing the following required column(s): Query, Answer.

So it seems like the requirements for ARES are quite more complex than expected. In the README file appears the following information:

"The ARES training pipeline is three steps:

Generate synthetic queries and answers from in-domain passages"

Then:

"A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal."

But to generate the synthetic dataset, it requires a query, document, and answer triples instead of a in-domain passages file as described.

There are tons of other inconsistencies, but given your code and documentation it's impossible to reproduce even the more basic examples.

[Feature Request] Multilingual support

Hi, thank you for sharing the wonderful project. I'd like to evaluate my Korean RAG applications, there's no RAG evaluation framework supporting multilingual. Could you add this feature for non english developers? If you guide me how to add codes to support multilingual, I'll contribute to it.

Thanks.

Precision-Performance Iteration (PPI) in README

Should the Precision-Performance Iteration (PPI) in the README be prediction-powered inference (PPI) or is this a different concept? I couldn't find it in the paper.

RAGAS score calculation from annotations is unclear

I'm not sure how the RAGAS score is computed from annotations in RAG_Automatic_Evaluation/RAGAS_Scoring.py:

# Lines 68-72
sampled_y_labels = dataset.sample(n=300, random_state=42)
context_relevance_prediction = sum(dataset["Context_Relevance_Label"].tolist()) / len(sampled_y_labels)
answer_relevance_prediction = sum(dataset["Answer_Relevance_Label"].tolist()) / len(sampled_y_labels)
context_scores.append(context_relevance_prediction)answer_relevance_scores.append(answer_relevance_prediction)

While I'm not sure what this code is trying to compute, I ran it to sanity check, and I got nan outputs:

Any help understanding this issue and pointers to the relevant sections in the paper would be greatly appreciated. Thanks!

Missing packages in linked Colab notebook

I tried running the linked Colab notebook from the label at the top of README.md.

When executing the notebook, it fails as ARES library is not installed. I tried adding:

!pip install ares-asi

As first cell but after executing it, I got the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires ipykernel==5.5.6, but you have ipykernel 6.29.4 which is incompatible.
google-colab 1.0.0 requires ipython==7.34.0, but you have ipython 8.25.0 which is incompatible.
Successfully installed Cython-3.0.10 accelerate-0.30.1 aiohttp-3.9.5 aiosignal-1.3.1 anthropic-0.28.0 ares-ai-0.6.0 asttokens-2.4.1 async-timeout-4.0.3 comm-0.2.2 datasets-2.19.1 dill-0.3.8 eval-type-backport-0.2.0 evaluate-0.4.2 executing-2.0.1 faiss-cpu-1.8.0 fastapi-0.110.2 frozenlist-1.4.1 fsspec-2024.3.1 h11-0.14.0 httpcore-1.0.5 httpx-0.27.0 ipykernel-6.29.4 ipython-8.25.0 ipywidgets-8.1.3 jedi-0.19.1 jiter-0.4.1 jupyterlab-widgets-3.0.11 multidict-6.0.5 multiprocess-0.70.16 numexpr-2.10.0 openai-1.14.2 pure-eval-0.2.2 pyarrow-16.1.0 pyarrow-hotfix-0.6 pytorch-ranger-0.1.1 pytz-2023.4 scipy-1.10.1 sentence-transformers-2.7.0 stack-data-0.6.3 starlette-0.37.2 tabulate-0.9.0 together-1.2.0 traitlets-5.14.3 transformers-4.40.1 widgetsnbextension-4.0.11 xxhash-3.4.1 yarl-1.9.4
WARNING: Upgrading ipython, ipykernel, tornado, prompt-toolkit, pyzmq can
cause your runtime to repeatedly crash or behave in unexpected ways and is not
recommended. If your runtime won't connect or execute code, you can reset it
with "Disconnect and delete runtime" from the "Runtime" menu.

WARNING: The following packages were previously imported in this runtime:
  [IPython]
You must restart the runtime in order to use newly installed versions.

Once you restart, it enters in an endless loop until you delete the runtime (but then, you return to the initial step, where ARES library is not available in the Colab environment.

--labels <label columns>

What is this labels list how it will look like as I have created a list with string elements [query,....] but it is showing key error mentioned one example of such list in documentation

Clarification Needed on the Specificity of test_dataset

Hello,

I am currently working with the project and have a question regarding the test_dataset used within. Could you please clarify whether the test_dataset needs to be domain-specific, particularly tailored to the RAG domain, or if a generic labeled dataset is suitable for this purpose?

Unable to import without setting OpenAI key

The readme mentions the following:

Optional: Initalize OpenAI or TogetherAI API key with the following command.

However I am not able to import ARES without setting the OpenAI key, this line

from ares import ARES

gives the following error:

openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

Can I use ARES without the OpenAI key? The readme claims it can work with custom RAG models.

getting TypeError: 'type' object is not subscriptable when importing the package

Hi, I am getting this error TypeError: 'type' object is not subscriptable
when I import the package

 from ares import ARES

this is the full error

----> 1 from ares import ARES

File ~/.local/lib/python3.8/site-packages/ares/__init__.py:1
----> 1 from .ares import ARES

File ~/.local/lib/python3.8/site-packages/ares/ares.py:1
----> 1 from .synthetic_generator import synthetic_generator_config
      2 from .binary_classifier import binary_classifer_config
      3 from .rag_scoring import rag_scoring_config

File ~/.local/lib/python3.8/site-packages/ares/synthetic_generator.py:1
----> 1 from .LLM_as_a_Judge_Adaptation.Generate_Synthetic_Queries_and_Answers import (
      2     load_model,
      3     load_documents,
      4     load_few_shot_prompt,
      5     generate_contradictory_answers,
      6     generate_few_shot_prompts,
      7     generate_synthetic_queries,
      8     Generate_Synthetic_Answers
      9 )
     11 def synthetic_generator_config(
     12     document_filepaths: list, 
...
     68         bool: True if the DataFrame contains all required columns, otherwise the program will exit with an error.
     69     """
     70     # Identify any missing columns

TypeError: 'type' object is not subscriptable

Iteration over labels and datasets not working in PPI

For evaluating RAG systems, the PPI config allows specifying multiple datasets and labels. These labels and datasets are iterated over in the rag_scoring_config method, however there is a return statement in the loop so only the first combination is actually evaluated.

Let me know if you could look into this. I could also make a PR to solve this if you let me know what the expected return value should be in this case.

Evaluation process only works with demo datasets, fails with any real dataset (that has only the columns described in the paper)

According to the section 3.3 of ARES paper:

"Ranking RAG Systems with Confidence Intervals

Once we have prepared our LLM judges, we need to use them to score and rank the competing RAG systems. To do this, ARES samples the in-domain query-document-answer triples produced by each RAG approach, and the judges label each triple, predicting their context relevance, answer faithfulness, and answer relevance. By averaging the individual predicted labels for each in-domain triple, we calculate the RAG system performance across each of the three metrics."

So, to evaluate a RAG configuration you should provide in-domain query-document-answer triples. ARES code in the repo doesn't support that claim and only works with the example datasets provided, that have all kinds of additional columns for benchmarking purposes.

This is a major issue because it makes it impossible to evaluate a real RAG configuration with your own data, that uses only the columns indicated in the ARES paper.

Baseline configuration

This is our sample code to evaluate a RAG configuration with the example datasets provided in the repo.

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt",
                    "checkpoints/ares_answer_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label", "Answer_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

This code returns an evaluation of the RAG configuration using the provided datasets and checkpoints.

How to reproduce the issue:

To reproduce the issue, we will eliminate all the non query-document-answer columns from the example datasets and try to evaluate a RAG configuration with them.

import pandas as pd

df_065 = pd.read_csv("nq_ratio_0.65.tsv", sep="\t")
df_07 = pd.read_csv("nq_ratio_0.7.tsv", sep="\t")

df_065 = df_065[["Query", "Document", "Answer"]]
df_07 = df_07[["Query", "Document", "Answer"]]
df_065.to_csv("nq_ratio_0.65_querydocanswer.tsv", sep="\t", index=False)
df_07.to_csv("nq_ratio_0.7_querydocanswer.tsv", sep="\t", index=False)

Note the original columns on the nq datasets:

print(df_065.columns)

# Index(['id', 'input', 'meta', 'output', 'wikipedia_id', 'Document',
#       'paragraph_number', 'Answer', 'Query', 'Context_Relevance_Label',
#       'Answer_Faithfulness_Label', 'Answer_Relevance_Label'],
#      dtype='object')

The new datasets only have the query-document-answer columns. Now we will try to evaluate those configurations again.

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_ratio_0.65_querydocanswer.tsv', 'nq_ratio_0.7_querydocanswer.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt",
                    "checkpoints/ares_answer_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label", "Answer_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

This code will raise an error when accessing the second label:

Traceback (most recent call last):                                                                                                                                                                                             
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Answer_Relevance_Label'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/data/work/external/pip_ARES_061/test_only_sample_datasets_bug.py", line 14, in <module>
    results = ares.evaluate_RAG()
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/ares/ares.py", line 144, in evaluate_RAG
    return rag_scoring_config(**self.ppi_config)
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/ares/rag_scoring.py", line 141, in rag_scoring_config
    test_set, Y_labeled_dataset, Y_labeled_dataloader, Y_labeled_predictions, Yhat_unlabeled_dataset, prediction_column = post_process_predictions(post_process_settings)
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py", line 1042, in post_process_predictions
    test_set = test_set[test_set[label] != 0]
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'Answer_Relevance_Label'

This error comes from the post_process_predictions(). Right after the evaluation from the first label ends and starts post processing the predictions, the code iterates over the label_columns trying to remove non valid records. Given that it tries to access a non existing column, pandas raises an error and aborts the evaluation process.

  if label_column in test_set.columns:
        test_set = test_set[test_set[label_column].notna()]
        
    for label in labels:
        if label != label_column:
            test_set = test_set[test_set[label] != 0]

Secondary issue

Given the issue, it could be tempting just to add the missing columns to the datasets, without adding any values to them. This will result in an error too in the preprocess_data(), as the whole process is so tied to the example datasets!

When adding the columns, given the content is empty, the error "Insufficient Data: Dataset has fewer than 10 rows after filtering!" will be raised.

# All records will be dropped here as the column is full of NaNs

if label_column in test_set.columns:
        test_set = test_set[test_set[label_column].notna()]
    
    # Combine query and document (and answer if applicable) into the text column
    # [..] if "Context" in label_column:
    
    # Preprocessing will fail given all rows have been dropped due to full of NaNs.

    # Check if the dataset has fewer than 10 rows after filtering
    if len(test_set) < 10:
        raise ValueError("Insufficient Data: Dataset has fewer than 10 rows after filtering!")

Filling the columns again with random data will make the evaluation process to run, but this behavior is totally counter of what a robust RAG evaluation framework should be, as these columns might confuse final users, induce fake results, etc.

import pandas as pd
import random

df_065 = pd.read_csv("nq_ratio_0.65_querydocanswer.tsv", sep="\t")
df_07 = pd.read_csv("nq_ratio_0.7_querydocanswer.tsv", sep="\t")

# Fill the Context_Relevance_Label and Answer_Relevance_Label columns with random data
df_065["Context_Relevance_Label"] = [random.randint(0, 1) for _ in range(len(df_065))]
df_065["Answer_Relevance_Label"] = [random.randint(0, 1) for _ in range(len(df_065))]
df_07["Context_Relevance_Label"] = [random.randint(0, 1) for _ in range(len(df_07))]
df_07["Answer_Relevance_Label"] = [random.randint(0, 1) for _ in range(len(df_07))]
df_065.to_csv("nq_ratio_0.65_random_label_values.tsv", sep="\t", index=False)
df_07.to_csv("nq_ratio_0.7_random_label_values.tsv", sep="\t", index=False)

Note: I have launched the evaluation process with the random data and it kinda worked, but got out of memory after running for like 4 hours. It should work with that data, but I have not completely verified that.

Expected behavior

I expect the code for ARES repo to follow the description in the paper, allowing users to evaluate real RAG configurations instead of working only with demo datasets that incorporate additional columns. These additional columns are not described as required in the paper nor it seems to be a good practice to force users to add them to their datasets, incorporating random fake data.

Checkpoint folder is not created automatically after training classifiers

After you run the training classifier code available at https://ares-ai.vercel.app/training_classifier/ the process fails returning the following error:

RuntimeError: Parent directory checkpoints/microsoft-mdeberta-v3-base does not exist.

I would expect the process to check if the folder exists and create it otherwise before writing the checkpoint to disk.

None of the tutorials work

Hi, the framework and paper look very promising, but, unfortunately, I've been unable to get any of the tutorials to work. Neither in a Colab notebook nor locally on my mac.

When I try with a Colab notebook, I get the same error as #46

When I try locally on my mac, I cannot even import ares

\>\>\> from ares import ARES

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/<placeholder>/Development/ARES/ares/__init__.py", line 1, in <module>
    from .ares import ARES
  File "/Users/<placeholder>/Development/ARES/ares/ares.py", line 3, in <module>
    from .rag_scoring import rag_scoring_config
  File "/Users/<placeholder>/Development/ARES/ares/rag_scoring.py", line 1, in <module>
    from ares.RAG_Automatic_Evaluation.LLMJudge_RAG_Compared_Scoring import begin
  File "/Users/<placeholder>/Development/ARES/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py", line 47, in <module>
    from ares.RAG_Automatic_Evaluation.Evaluation_Functions import (
  File "/Users/<>/Development/ARES/ares/RAG_Automatic_Evaluation/Evaluation_Functions.py", line 24, in <module>
    from vllm import LLM
ModuleNotFoundError: No module named 'vllm'

vllm requires linux - https://docs.vllm.ai/en/latest/getting_started/installation.html

Therefore, sadly the nice looking documentation is quite deceiving, as the ARES lib is utterly unusable at the moment.

New README file instructions are incorrect

I am following along the instructions in the new README.md and they don't work as expected.

Note: I have installed ARES using the instructions at https://ares-ai.vercel.app/installation/ ,given that the Python version has not been bumped to any new release. The previous codebase was 0.2.3, current version in PyPi is still 0.2.3.

In the Quick Start 1 tutorial, this wget commands point to datasets that were deleted during the last update. So:

wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets/nq_few_shot_prompt_v1.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_unlabeled_output.tsv

returns error 404 for all files:

--2024-04-23 00:39:10--  https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_unlabeled_output.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-04-23 00:39:10 ERROR 404: Not Found.

When executing the ues_idp block for the first time a ModuleNotFound is returned. vLLM package is missing, so it has to be installed manually.

In step 2, synthetic dataset generation, document_filepath is expected to be a list, a str is passed. synthetic_queries_filename parameter is incorrect. Correct name is synthetic_queries_filenames and it's of type list, not str.

File nq_few_shot_prompt_for_synthetic_query_generation.tsv under examples only has Query and Document as columns. Running the synthetic dataset generation code returns: KeyError: 'Context_Relevance_Label' as that column is missing from the file.

In Step 3, the route to training dataset should be data/output/synthetic_queries_1.tsv as in the previous code block. Data is missing at the beginning of the path.

Parameter 'training_dataset' for classifier_model is expected to be of type list, received str instead.
Parameter 'validation_set' for classifier_model is expected to be of type list, received str instead.
Parameter 'label_column' for classifier_model is expected to be of type list, received str instead.

There might be more errors once I am able to run the code, but I've not been able to generate the synthetic dataset using flan because of the incorrect few_shot_file

pip install ares-ai , restart session (google colab)
from ares import ARES from readme
GIves:

vLLM not imported.
ImportError                               Traceback (most recent call last)
[<ipython-input-2-6e6f35634c3a>](https://localhost:8080/#) in <cell line: 1>()
----> 1 from ares import ARES

3 frames
[/usr/local/lib/python3.10/dist-packages/ares/__init__.py](https://localhost:8080/#) in <module>
----> 1 from .ares import ARES

[/usr/local/lib/python3.10/dist-packages/ares/ares.py](https://localhost:8080/#) in <module>
      1 from .synthetic_generator import synthetic_generator_config
      2 from .binary_classifier import binary_classifer_config
----> 3 from .rag_scoring import rag_scoring_config
      4 from .ues_idp import ues_idp_config
      5 from .kilt_filter import KILT_dataset_process

[/usr/local/lib/python3.10/dist-packages/ares/rag_scoring.py](https://localhost:8080/#) in <module>
----> 1 from ares.RAG_Automatic_Evaluation.LLMJudge_RAG_Compared_Scoring import begin
      2 from ares.RAG_Automatic_Evaluation.LLMJudge_RAG_Compared_Scoring import filter_dataset
      3 from ares.RAG_Automatic_Evaluation.LLMJudge_RAG_Compared_Scoring import preprocess_data
      4 from ares.RAG_Automatic_Evaluation.LLMJudge_RAG_Compared_Scoring import load_api_model
      5 from ares.RAG_Automatic_Evaluation.LLMJudge_RAG_Compared_Scoring import load_tokenizer_and_model

[/usr/local/lib/python3.10/dist-packages/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py](https://localhost:8080/#) in <module>
     45 
     46 from ares.RAG_Automatic_Evaluation.ppi import clt_iid, binomial_iid, pp_mean_iid_asymptotic
---> 47 from ares.RAG_Automatic_Evaluation.Evaluation_Functions import (
     48     calculate_accuracy, few_shot_context_relevance_scoring,
     49     few_shot_answer_faithfulness_scoring, few_shot_answer_relevance_scoring,

ImportError: cannot import name 'few_shot_context_relevance_scoring_vllm' from 'ares.RAG_Automatic_Evaluation.Evaluation_Functions' (/usr/local/lib/python3.10/dist-packages/ares/RAG_Automatic_Evaluation/Evaluation_Functions.py)

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------