microsoft / rag-experiment-accelerator Goto Github PK

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

Home Page: https://github.com/microsoft/rag-experiment-accelerator

License: Other

Python 88.81% HTML 5.85% Dockerfile 0.06% Shell 0.21% Bicep 4.38% Makefile 0.69%

acs chunking embedding rag evaluation experiment information-retrieval openai azure genai

rag-experiment-accelerator's People

Contributors

Stargazers

Watchers

Forkers

gmh5225 luc4t helder-daiha ai-bassem hooriya535 retaiba-rachid shivam-51 qlycool kpkool eltociear tomconte mohanajuhi166 haywardvb laranguyen811 1daste091 operaha ssmanehs kohnnness pkt1583 christoph-gl mvbugge guybartal shadiwodi sobhangat5 amoslem397 fr4nc3 rayjue cognitive-automation-labs sarvan24 guilhermelia akashtalole jfontestad sergiorru sjuratov maoyuexin kmeleonservices lovinggracem pawankarki23 zihuq-msft merico34 letemptt

rag-experiment-accelerator's Issues

Move init_openai and rest credentials to config

Move credentials check and initialization to config, rather than randomly all over the code

Add prompt templates for Healthcare industry use cases

Replace NLTK with Spacy

Config files that need to be updated and instructions to do that

When running 02_qa_generation locally OPENAI_API_KEY and other ENV Variables are not available

Even when the .env is populated, the following error is raised when running 02_qa_generation.py locally

    from ingest_data.acs_ingest import generate_qna
  File "/Users/shanepeckham/sources/rag-experiment-accelerator/ingest_data/acs_ingest.py", line 9, in <module>
    from llm.prompt_execution import generate_response
  File "/Users/shanepeckham/sources/rag-experiment-accelerator/llm/prompt_execution.py", line 8, in <module>
    openai.api_key = os.environ['OPENAI_API_KEY']
  File "/Users/shanepeckham/opt/miniconda3/envs/rag-test/lib/python3.10/os.py", line 680, in __getitem__
    raise KeyError(key) from None
KeyError: 'OPENAI_API_KEY'

Update `acs_ingest.py` to incorporate any new types of prompts

Extend readme with more details for the first setup

Readme:
Replace
python -m pip install -r requirements.txt
with
python -m pip install -f requirements.txt

Shall we also note that on fresh machine it'll trigger error:
"Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools"

Minimal selection:

Under the "Workloads" tab, check the box for "Desktop development with C++."
Under the "Individual components" tab, make sure the following components are selected:
"Windows 10 SDK"
"MSVC v142 - VS 2019 C++ x64/x86 build tools"
"C++ ATL for latest v142 build tools (x86 & x64)"
"C++ MFC for latest v142 build tools (x86 & x64)"

Add test coverage to the core functions

Tasks

Beta Give feedback

01_index.py: Add tests for search index creation (create_acs_index) across different variants (e.g. search_config.json settings, semantics, chunk sizes, overlap sizes, analyzers, etc.) #253

engfundamentals
01_index.py: Add tests for ingestion and index creation for different entities (e.g. keywords, summary, title, etc.) #254

engfundamentals
03_query.py: Add tests to validate against different search types #255

engfundamentals
03_query.py: Add tests to validate response evaluation output params (e.g expected vs actual) #256

engfundamentals
Add validation tests for config parameters (e.g. chunk size limits, etc) #257

0 of 1

engfundamentals
02_qa_generation.py: Add validation tests generation for different Q&A prompt instructions. #258

engfundamentals
Options

logger should show output to console as well

Put all the runtime generated artifacts into separate folder

Files like generated_index_names and other artifacts created in runtime appear as a difference on commit.
Let's move them all to a separate folder and gitignore it.

Configure spacy to be downloaded only once

Update the files that are downloading spacy and replace that with spacy download to be performed only once for the project.

Check existence of models on init openai

          You could also check existence of models at this stage. It can be done like this

from azure.core.exceptions import ResourceNotFoundError
from azure.ai.openai import OpenAIApiClient

client = OpenAIApiClient()

try:
deployment = client.deployments.get("your-deployment-name")
print("Deployment exists.")
except ResourceNotFoundError:
print("Deployment does not exist.")

Originally posted by @WVadim in #58 (comment)

Move OPENAI_EMBEDDING_DEPLOYED_MODEL to config

Currently OPENAI_EMBEDDING_DEPLOYED_MODEL is declared in .env file, while model used for chat completions is declared in config file. Anticipating more models in the future, it makes sense to move all model names into config under section "nlp_models" or something similar.

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

✅ Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

staging : This repository will ship as Open Source or go public

collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.

delete : This repository will be deleted because it is no longer needed.

other : Other reasons not specified

Need more help? 🖐️

Email [email protected]. ✉️
Post your questions in GitHub inside Microsoft Team in Microsoft Teams. 🗨️

Replace prints with proper logging

Tasks

Beta Give feedback

Introduce logging
Add progress tracking for iterative processes
Options

Resources that need to be deployed, environment variables in .env file

Add prompt templates for Healthcare industry use cases

Update `04_evaluation.py` to run with config class

Update 04_evaluation.py to run with config class instead of reading from config file directly.
Refer to 03_qa_generation.py

Environmental variables not accessibly when code is run from pyCharm

Place load_dotenv() before the imports of the code that references the env variables

Enable Blob Store and HTML pages data connectors to ingest data

Possible data connectors:

HTML web pages
Azure Blob Storage
Sharepoint
md from repos

Validate existence of models and keys

We should validate existence of all models and keys required for a particular step before executing the step. At the moment, in step 01_Index.py of model chat_model_name does not exist, we will build a bunch of embeddings only to fail at the last line of the script.
This logic should also be propagated to other scripts, in order to save user's resources and time

Running each of the scripts. Inputs and outputs required to run each of those. What to expect

Add markdown loading

Add code to load markdown files

Prompt Engineering within the RAG Pattern

Should include prompt templates within the RAG Experiment Accelerator to allow the user to experiment.
Current prompt templates are located in llm\prompts.py

Tasks

Beta Give feedback

Add some instructions in readme to let customers know they should add pdfs in data folder then run 01_Index.py?

I encountered such error when running 01_Index.py file following the guidance in readme.md. After some debugging, I figured out it is because the data folder is empty.
Can you pls add some instructions in readme to let customers know they should add pdfs in data folder then run 01_index.py?

Traceback (most recent call last):
File "C:\Users\yijunzhang\Documents\code_repo\rag-experiment-accelerator\01_Index.py", line 72, in
upload_data(data_load,service_endpoint,index_name,key, dimension, chat_model_name, temperature)
File "C:\Users\yijunzhang\Documents\code_repo\rag-experiment-accelerator\ingest_data\acs_ingest.py", line 46, in upload_data
results = search_client.upload_documents(documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\yijunzhang\Anaconda3\envs\fhl\Lib\site-packages\azure\search\documents_search_client.py", line 543, in upload_documents
results = self.index_documents(batch, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\yijunzhang\Anaconda3\envs\fhl\Lib\site-packages\azure\core\tracing\decorator.py", line 76, in wrapper_use_tracer
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\yijunzhang\Anaconda3\envs\fhl\Lib\site-packages\azure\search\documents_search_client.py", line 641, in index_documents
return self._index_documents_actions(actions=batch.actions, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\yijunzhang\Anaconda3\envs\fhl\Lib\site-packages\azure\search\documents_search_client.py", line 649, in _index_documents_actions
batch_response = self._client.documents.index(batch=batch, error_map=error_map, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\yijunzhang\Anaconda3\envs\fhl\Lib\site-packages\azure\core\tracing\decorator.py", line 76, in wrapper_use_tracer
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\yijunzhang\Anaconda3\envs\fhl\Lib\site-packages\azure\search\documents_generated\operations_documents_operations.py", line 1264, in index
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (MissingRequiredParameter) The request is invalid. Details: actions : No indexing actions found in the request. Please include between 1 and 32000 indexing actions in your request.
Code: MissingRequiredParameter
Message: The request is invalid. Details: actions : No indexing actions found in the request. Please include between 1 and 32000 indexing actions in your request.
Exception Details: (MissingIndexDocumentsActions) No indexing actions found in the request. Please include between 1 and 32000 indexing actions in your request. Parameters: actions
Code: MissingIndexDocumentsActions
Message: No indexing actions found in the request. Please include between 1 and 32000 indexing actions in your request. Parameters: actions

Support text files for ingestion

Support txt, md, HTML and PDF files for ingestion

Include additional metrics into the evaluators

(i.e. entitlement)

Add versioning in artifacts naming

Introduce schema to name all the artifacts generated:

case name
parameters combination as a prefix

Fix runtime warnings

Support OpenAI models directly

The Accelerator currently supports models from Azure, but we should allow access to models provided directly from OpenAI.

In readme.md, give 2 examples of .env for both openai customers and azure openai customers.

For openai, you dont have to set endpoint, just key and org. For azure openai, we have endpoint, etc.

Can you do some modifications like this to provide 2 .env examples, one for openai, one for azure openai.

Augmentation of ingested data

Current implementation captures the "title" and "summary", will need to include keywords, keyphrases, and entities. Need to edit the prompts.py and acs_ingest.py files to request that the LLM generate keywords and entities.

RAG Pattern for Multi-Lingual Scenarios

Azure Cognitive Search has skillsets for language detection/processing. Exploration will be required to determine the best implementation. Should experiment with German, Italian, and English as languages to be tested, as these are currently being used in an active customer engagement.

Tasks

Beta Give feedback

Determine if the default Standard Lucene language analyzer is sufficient
Determine support guidance for index creation and querying (e.g. blended or language-specific indexes)
Add LanguageDetectionSkill support
Create language analyzer settings analyzers, tokenizers, token_filters, etc to search_config.json and update SearchIndexClient settings
Set data ingestion limits/constraints on chunk size to adhere to maximum record size (i.e. 50k characters) - see search_config.json.
Options

Fix first time installation issues

No module named 'azure.core'

Reading data from multiple file formats to include content into index

Need to standardize the process of loading and returning data ingested into the tool.

Prioritize MD files.
Next should be HTML

File formats to consider:

PDF
HTML
MD
docx

Tasks

Beta Give feedback

Enable Prompt-like Interaction with the Accelerator

Interactive tool which helps user choose intent, query type, understand their data and possible label sets, focus on different relevant parts of the documents to generate different results to help user choose what makes most sense.

Can be summarized as content understanding and intent understanding:

Understanding source of data, potential sources of data and type of data
Leveraging CMS sources including metadata
Questions to identify what content processing would work best
Questions to identify intent - Comparison intent, search intent, opinion intent

Add function docstrings

QA generation - JSON not valid bug

Currently running the 02_qa_generation.py sometimes provides an invalid JSON output.
Update or fix the QA generation prompt in prompts.py to make sure that only a valid JSON is created as a result of running the 02_qa_generation.py script.

Add reading of the data from HTML files

Add reading from HTML files and splitting them into parts.

Load Documents into the Accelerator in parallel

Parallel jobs need to be enabled within the Accelerator to accept the ingestion and processing of multiple documents/artifacts.

Potential pattern to be applied would be via the implementation of Azure Machine Learning pipelines.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Make conda environment file and add versions

Currently we are expecting some version of python and fixed versions of libraries, but do not specify it anywhere except README.
What could be done, is to create environment.yaml for conda, to automatically create conda env with everything required, including preloaded models and versioned libraries as well as specify python version

Generating Summary Q&As for entire documents

Remove PII from Git commit history

For example, the eval_data.jsonl file has PII (emails, phone numbers, job titles, etc.) related to personnel who work for a customer. All files need to be scanned for PII and sanitized.

Implement Azure OpenAI Embeddings Models

Create a config file associated with the generate_embeddings.py file that can provide the appropriate sizes and chunks for Azure OpenAI embeddings models.

The definition for embedding_dimension should be automatically mapped to the selected embeddings model.

Also update documentation with the relevant instructions for users.

Background:
Azure Open AI offers embeddings that can be used to do search and analyze complex documents. Some examples of complex documents are:

Legal contracts
Medical records
Scientific articles

Embeddings are numerical representations of words and phrases that capture the meaning and context of the text. These embeddings can be used to build powerful search and analysis tools that can extract insights from large volumes of text data.

Everything should be linked to the search configuration.

https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/understand-embeddings

Documentation on "Task-specific knowledge enrichment" can assist in the implementation of embeddings models.

Documentation

Add documentation to explain:

Tasks

Beta Give feedback

Resources that need to be deployed, environment variables in .env file #100

documentation
Config files that need to be updated and instructions to do that #101

documentation
Running each of the scripts. Inputs and outputs required to run each of those. What to expect #102

documentation
In readme.md, give 2 examples of .env for both openai customers and azure openai customers. #19

documentation
Update readme.md to meet criteria in MSFT Style Guide #129

documentation
Update variables in the .env file to specify Azure ML #130

documentation
Options

Implement Search Evaluator while querying

Need to evaluate the search response from Azure Cognitive Search to determine its information retrieval accuracy. A score for response relevancy should be included as part of the tooling for this section.

https://learn.microsoft.com/en-us/azure/search/search-pagination-page-layout

https://learn.microsoft.com/en-us/azure/search/index-similarity-and-scoring

Search metrics that should be included:

precision@k
recall@k
map@k

Tasks

Beta Give feedback

evaluate search score format
persist search score
display search score in mlflow
Options

microsoft / rag-experiment-accelerator Goto Github PK

rag-experiment-accelerator's People

Contributors

Stargazers

Watchers

Forkers

rag-experiment-accelerator's Issues

Tasks

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

Action

Instructions

Need more help? 🖐️

Tasks

Tasks

Tasks

Tasks

Tasks

Tasks

Tasks

Recommend Projects

Recommend Topics

Recommend Org