microsoft / graphrag Goto Github PK

View Code? Open in Web Editor NEW

13.1K 82.0 1.1K 141.7 MB

A modular graph-based Retrieval-Augmented Generation (RAG) system

Home Page: https://microsoft.github.io/graphrag/

License: MIT License

JavaScript 0.08% CSS 0.14% Nunjucks 0.55% Jupyter Notebook 3.79% Shell 0.06% Python 95.18% Jinja 0.19%

graphrag rag llm llms gpt gpt-4 gpt4

graphrag's Introduction

GraphRAG

👉 Use the GraphRAG Accelerator solution
👉 Microsoft Research Blog Post
👉 Read the docs
👉 GraphRAG Arxiv

Overview

The GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.

To learn more about GraphRAG and how it can be used to enhance your LLMs ability to reason about your private data, please visit the Microsoft Research Blog Post.

Quickstart

To get started with the GraphRAG system we recommend trying the Solution Accelerator package. This provides a user-friendly end-to-end experience with Azure resources.

Repository Guidance

This repository presents a methodology for using knowledge graph memory structures to enhance LLM outputs. Please note that the provided code serves as a demonstration and is not an officially supported Microsoft offering.

⚠️ Warning: GraphRAG indexing can be an expensive operation, please read all of the documentation to understand the process and costs involved, and start small.

Diving Deeper

To learn about our contribution guidelines, see CONTRIBUTING.md
To start developing GraphRAG, see DEVELOPING.md
Join the conversation and provide feedback in the GitHub Discussions tab!

Prompt Tuning

Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.

Responsible AI FAQ

See RAI_TRANSPARENCY.md

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Privacy

Microsoft Privacy Statement

graphrag's People

Contributors

Stargazers

Watchers

Forkers

icdev2dev conglesolutionx leegisang st3v3a iosub s04 shivarama23 tiago-peres suryatmodulus mac999 roysh caleb-sheard-nuvento emreyilmaz46 ivorobyev med-dev-99 vinayreddy100 pirateforfreedom ohdearquant graphrag aseetpatel21 powerfulmoves polya20 md-experiments hwangjohn bananemure paperwave mojowebs timothymeyers rhinojosa dshushin theolivenbaum shreezus yooneo llmapparchitect hichamhamid id-2 walterwsmf drasaadmoosa mybigwang zineanteoh tomchapin serenahangsinclair sridhararrabelly skaiphd jaytoday sumedh-vemuri paulwang1905 baris-unver mameen mbrukman andersonamaral2 styner2023 pparke igor-kozlov v-sekai-fire schaferk justicedao lexpublicus zrich jaesuphwang lemonit-eric-mao xujd thomz1 cove9988 deku0818 isold23 reachrkr yuejunzhang huqianghui danttis nathan0x6c1 brianjking yongchao-liu jolks ml-aware24k bodhihu octag0no aldensiol duke24k kufeng76 jewel-lee jingsong-yan liwenju0 linkinng intelligenceaccess jh941213 shoodyminamoto devh0 cleardry genostack zhangever aimdreamboy ocwc22 alvinzhu123 hirajanwin xuwudawei tinghao-ai steveterry66 aryan090820 nghiauet

graphrag's Issues

Update Azure Smoke text with vector-db embedding insertion

Allow JSONL Input Type

As an alternative to CSV, we should allow structured input from JSONL files, and use similar configs to CSV mode to drill down into text, title, timestamp, etc.. fields.

IO error in lancedb write operation

Facing an IOError in the lancedb write operation in the store_entity_semantic_embeddings call during local search (https://github.com/microsoft/graphrag/blob/main/examples_notebooks/local_search.ipynb).

I was able to reproduce the error with following code:

import lancedb
import pandas as pd
import pyarrow as pa

uri = "data/sample-lancedb"
db = lancedb.connect(uri)
data = [
    {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
    {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
]
tbl = db.create_table("my_table", data=data)

Is there any standard way to fix this that I might have missed? Any authentication I need to do with lancedb?

Question: Can one add more documents without re-processing everything

The question sort of say it all.

I ran graphrag on a single .txt file in the input directory. I added a new file and tried running the indexing with --resume from the original run... It stated that the 2 documents were processed but obviously did not do anything with the new one.

Can new documents be added wo the graph without having to re-run everything from scratch?

How to visualize the graph?

After embedding the dataset and graph creation, is there anyway currently to visualise the data, as it is shown in the blogpost?

Error in create_final_text_units workflow when using Vector Database

Problem Description

In line 144 of the create_final_text_units.py workflow the verb attempts to load the text_embedding column which does not exists in the dataframe when we define a vector database.

Solution

Copy the logic defined in the create_final_entities.py workflow, when we condition the verb on whether the embeddings are in a dataframe or a vector database.

See line 27 for an example of how to detect a vector store.

Notes

This line causes the below error

graphrag/graphrag/index/workflows/v1/create_final_text_units.py

Line 144 in 60a197f

*([] if skip_text_unit_embedding else ["text_embedding"]),
- error: datashaper.workflow.workflow ERROR Error executing verb "select" in create_final_text_units: "['text_embedding'] not in index"

This is a good condition to know whether the embeddings are in a dataframe or a vector database

graphrag/graphrag/index/workflows/v1/create_final_entities.py

Lines 27 to 30 in 60a197f

    
           is_using_vector_store = ( 
        
               entity_name_embed_config.get("strategy", {}).get("vector_store", None) 
        
               is not None 
        
           )

Broken link for Prompt Tuning Guide

Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.

https://microsoft.github.io/graphrag/

https://microsoft.github.io/graphrag/posts/index/3-prompt_tuning

Missing documentation settings for query

Documentation for the query environment variables is missing two variables that are required when running queries with Azure OpenAI.

GRAPHRAG_LLM_DEPLOYMENT_NAME and GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME are required variables when GRAPHRAG_LLM_TYPE=azure_openai_chat and GRAPHRAG_EMBEDDING_TYPE=azure_openai_embedding respectively.

When using Azure OpenAI, there are several more query ENV variables that are required. I would propose reorganizing the documentation page to match this page where the table layout makes it easier to understand what variables are needed for use with AOAI.

[Ollama] GraphRAG Community Support for running Ollama

is there a working example for using Ollama? Or is it not supposed to work? Did try, but without any success.

Thanks in advance

[Ollama][Other] GraphRAG OSS LLM community support

What I tried:
I ran this on my local GPU and and tried replacing the api_base to a model served on ollama in settings.yaml file.
model: llama3:latest
api_base: http://localhost:11434/v1 #https://.openai.azure.com

Error:
graphrag.index.reporting.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\n-Goal-\nGiven a text document that is pot....}

Commands:
#initialize
python -m graphrag.index --init --root .

#index
python -m graphrag.index --root .

#query
python -m graphrag.query --root . --method global "query"

#query
python -m graphrag.query --root . --method local "query"

Does graphrag support other llm hosted server frameworks?

Error using vllm OpenAI server

INFO: 172.16.80.35:48532 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 89, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 68, in create_chat_completion
sampling_params = request.to_sampling_params()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/protocol.py", line 157, in to_sampling_params
return SamplingParams(
^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/sampling_params.py", line 157, in init
self._verify_args()
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/sampling_params.py", line 172, in _verify_args
if self.n < 1:
^^^^^^^^^^
TypeError: '<' not supported between instances of 'NoneType' and 'int'
INFO 07-04 07:24:41 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:24:51 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:25:01 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:25:11 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:25:21 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:25:31 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:25:41 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:25:51 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:26:01 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:26:11 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:26:21 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:26:31 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:26:41 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:26:51 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:27:01 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:27:11 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-04 07:27:21 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO: 172.16.80.35:53354 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in call
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 89, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 68, in create_chat_completion
sampling_params = request.to_sampling_params()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/protocol.py", line 157, in to_sampling_params
return SamplingParams(
^^^^^^^^^^^^^^^
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/sampling_params.py", line 157, in init
self._verify_args()
File "/home/bigdata/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/sampling_params.py", line 172, in _verify_args
if self.n < 1:
^^^^^^^^^^
TypeError: '<' not supported between instances of 'NoneType' and 'int'

Bump scipy from 1.12.0 to 1.13.0

Scipy 1.13.0 fails because the triu import changed or was deprecated.
We need to find the alternative path or solution so we can upgrade to latest and remove the constraint over 1.12

Is the examples leaking into the final embeddings?

I can't tell if it's leaking or not. If it is leaking please pick factual examples.


######################
-Examples-
######################
Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.

Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. �If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.�

The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.

It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
################
Output:
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}
("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}
("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."{tuple_delimiter}6){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."{tuple_delimiter}5){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}9){completion_delimiter}
#############################
Example 2:

Entity_types: [person, technology, mission, organization, location]
Text:
They were no longer mere operatives; they had become guardians of a threshold, keepers of a message from a realm beyond stars and stripes. This elevation in their mission could not be shackled by regulations and established protocols�it demanded a new perspective, a new resolve.

Tension threaded through the dialogue of beeps and static as communications with Washington buzzed in the background. The team stood, a portentous air enveloping them. It was clear that the decisions they made in the ensuing hours could redefine humanity's place in the cosmos or condemn them to ignorance and potential peril.

Their connection to the stars solidified, the group moved to address the crystallizing warning, shifting from passive recipients to active participants. Mercer's latter instincts gained precedence� the team's mandate had evolved, no longer solely to observe and report but to interact and prepare. A metamorphosis had begun, and Operation: Dulce hummed with the newfound frequency of their daring, a tone set not by the earthly
#############
Output:
("entity"{tuple_delimiter}"Washington"{tuple_delimiter}"location"{tuple_delimiter}"Washington is a location where communications are being received, indicating its importance in the decision-making process."){record_delimiter}
("entity"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"mission"{tuple_delimiter}"Operation: Dulce is described as a mission that has evolved to interact and prepare, indicating a significant shift in objectives and activities."){record_delimiter}
("entity"{tuple_delimiter}"The team"{tuple_delimiter}"organization"{tuple_delimiter}"The team is portrayed as a group of individuals who have transitioned from passive observers to active participants in a mission, showing a dynamic change in their role."){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Washington"{tuple_delimiter}"The team receives communications from Washington, which influences their decision-making process."{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"The team is directly involved in Operation: Dulce, executing its evolved objectives and activities."{tuple_delimiter}9){completion_delimiter}
#############################
Example 3:

Entity_types: [person, role, technology, organization, event, location, concept]
Text:
their voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.

"It's like it's learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers' a whole new meaning."

Alex surveyed his team�each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."

Together, they stood on the edge of the unknown, forging humanity's response to a message from the heavens. The ensuing silence was palpable�a collective introspection about their role in this grand cosmic play, one that could rewrite human history.

The encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation
#############
Output:
("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}
("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}
("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}
("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}
("entity"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity's Response is the collective action taken by Alex's team in response to a message from an unknown intelligence."){record_delimiter}
("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}9){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity's Response to the unknown intelligence."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}7){completion_delimiter}
#############################```

Which LLM models are supported？

Whether other LLM models are supported, such as ChatGLM and QWEN？

Update error message when TPM thresholds have been met

Currently if TPM quota is exceeded while indexing data, an error message Error Invoking LLM gets printed out in the logs. With throttling and rate-limiting being used, this particular message gives users a false sense that an error has occurred and is not being managed. It would be more informative to users if the message were a warning that says something like LLM rate limit exceeded....waiting and will try again. Once max_retries has been met, then an error message should be logged.

Input reading fails on Windows

Text Units are null when processing inputs on Windows.

Provide an option to fail fast when LLM calls timeout or are exhausted

Pipeline yields "corrupted" dataframes when Endpoint TPM thresholds have been exceeded to many times.
We should provide options to determine the failure behavior on LLM-based operations, so we can fail fast and emit the issue instead of continuing and thus generating incorrect dataframes.

Bump ruff from 0.2.2 to 0.3.5

Dependabot version update for ruff fails, since using Random fails againts a newly introduces rule.

Investigate this and find the appropriate solution so we can upgrade

Add str methods to Pydantic models to allow for config printouts

Azure Smoke Test should use AOAI

Add LLM Telemetry Configuration

Add config for wiring in Azure Insights
Report on some high-level stats in LLM jobs, push out to stats.json (e.g. generated tokens per minute, error rate, mean request time, request time tp80, etc..)

Helloworld walkthrough failing on entity graph creation

When running create_base_extracted_entities, the entity extraction seems to work fine (per checking the cache), but when the merge_graph stage runs, it fails silently.

Other than a bunch of "error invoking LLM" from having too many threads, these are the only interesting log file entries:

When running create_base_extracted_entities, the entity extraction seems to work fine (per checking the cache), but when the merge_graph stage runs, it fails silently.

Other than a bunch of "error invoking LLM" from having too many threads, these are the only interesting log file entries:

{"type": "error", "data": "Error Invoking LLM", "stack": "Traceback (most recent call last):\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\base_llm.py\", line 57, in _invoke\n    output = await self._execute_llm(input, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\openai\\openai_chat_llm.py\", line 55, in _execute_llm\n    completion = await self.client.chat.completions.create(\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\openai\\resources\\chat\\completions.py\", line 1289, in create\n    return await self._post(\n           ^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\openai\\_base_client.py\", line 1805, in post\n    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\openai\\_base_client.py\", line 1503, in request\n    return await self._request(\n           ^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\openai\\_base_client.py\", line 1599, in _request\n    raise self._make_status_error_from_response(err.response) from None\nopenai.RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-02-15-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 48 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}\n", "source": "Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-02-15-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 48 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}", "details": {"input": "\n-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n\n-Steps-\n1. Identify all entities. For each identified entity, extract the following information:\n- entity_name: Name of the entity, capitalized\n- entity_type: One of the following types: [person, location, organization, document, event, relationship, object]\n- entity_description: Comprehensive description of the entity's attributes and activities\nFormat each entity as (\"entity\"<|><entity_name><|><entity_type><|><entity_description>\n\n2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.\nFor each pair of related entities, extract the following information:\n- source_entity: name of the source entity, as identified in step 1\n- target_entity: name of the target entity, as identified in step 1\n- relationship_description: explanation as to why you think the source entity and the target entity are related to each other\n- relationship_strength: an integer score between 1 to 10, indicating strength of the relationship between the source entity and target entity\n\nFormat each relationship as (\"relationship\"<|><source_entity><|><target_entity><|><relationship_description><|><relationship_strength>)\n\n3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **##** as the list delimiter.\n\n4. When finished, output <|COMPLETE|>\n\n-Examples-\n######################\n\nExample 1:\n\nentity_types: [person, location, organization, document, event, relationship, object]\ntext:\n were sonorous and melancholy. Occasionally they were fantastic and cheerful. Clearly they reflected the thoughts which possessed him, but whether the music aided those thoughts, or whether the playing was simply the result of a whim or fancy was more than I could determine. I might have rebelled against these ex- asperating solos had it not been that he usually terminated them by playing in quick succession a whole series of my favourite airs as a slight com- pensation for the trial upon my patience. \nDuring the first week or so we had no callers, and I had begun to think that my companion was as friendless a man as I was myself. Presently, however, I found that he had many acquaintances, and those in the most different classes of society. There was one little sallow rat-faced, dark-eyed fel- low who was introduced to me as Mr. Lestrade, and who came three or four times in a single week. One morning a young girl called\n------------------------\noutput:\n**Entities:**\n\n(\"entity\"{\"tuple_delimiter\"}\"Mr. Lestrade\"{\"tuple_delimiter\"}\"person\"{\"tuple_delimiter\"}\"A little sallow rat-faced, dark-eyed fellow who visited three or four times in a single week.\")\n\n\n**Relationships:**\n\n(\"relationship\"{\"tuple_delimiter\"}\"Mr. Lestrade\"{\"tuple_delimiter\"}\"companion\"{\"tuple_delimiter\"}\"Mr. Lestrade is an acquaintance of the narrator's companion.\"{\"tuple_delimiter\"}6)\n\n<|COMPLETE|>\n#############################\n\n\nExample 2:\n\nentity_types: [person, location, organization, document, event, relationship, object]\ntext:\n and spread out the documents upon his knees. Then he lit his pipe and sat for some time smoking and turning them over. \n\"You never heard me talk of Victor Trevor?\" he asked. \"He was the only friend I made during the two years I was at college. I was never a very socia- ble fellow, Watson, always rather fond of moping in my rooms and working out my own little meth- ods of thought, so that I never mixed much with the men of my year. Bar fencing and boxing I had few athletic tastes, and then my line of study was quite distinct from that of the other fellows, so that we had no points of contact at all. Trevor was the \n\n\nonly man I knew, and that only through the acci- dent of his bull terrier freezing on to my ankle one morning as I went down to chapel. \n\"It was a prosaic way of forming a friendship, but it was effective. I was laid by the heels\n------------------------\noutput:\n**Entities:**\n\n(\"entity\"{\"tuple_delimiter\"}\"Victor Trevor\"{\"tuple_delimiter\"}\"person\"{\"tuple_delimiter\"}\"Victor Trevor is the only friend the narrator made during his two years at college. Their friendship began when Trevor's bull terrier bit the narrator's ankle.\"}\n\n(\"entity\"{\"tuple_delimiter\"}\"Watson\"{\"tuple_delimiter\"}\"person\"{\"tuple_delimiter\"}\"Watson is the person being spoken to by the narrator. He is a companion and confidant of the narrator.\"}\n\n(\"entity\"{\"tuple_delimiter\"}\"college\"{\"tuple_delimiter\"}\"location\"{\"tuple_delimiter\"}\"The college is the place where the narrator spent two years and met Victor Trevor.\"}\n\n(\"entity\"{\"tuple_delimiter\"}\"chapel\"{\"tuple_delimiter\"}\"location\"{\"tuple_delimiter\"}\"The chapel is the location where the narrator was heading when he encountered Victor Trevor's bull terrier.\"}\n\n(\"entity\"{\"tuple_delimiter\"}\"bull terrier\"{\"tuple_delimiter\"}\"object\"{\"tuple_delimiter\"}\"The bull terrier is the dog belonging to Victor Trevor that bit the narrator's ankle, leading to their friendship.\"}\n\n**Relationships:**\n\n(\"relationship\"{\"tuple_delimiter\"}\"Victor Trevor\"{\"tuple_delimiter\"}\"bull terrier\"{\"tuple_delimiter\"}\"The bull terrier belongs to Victor Trevor and was the catalyst for the friendship between Victor Trevor and the narrator.\"{\"tuple_delimiter\"}8)\n\n(\"relationship\"{\"tuple_delimiter\"}\"Victor Trevor\"{\"tuple_delimiter\"}\"college\"{\"tuple_delimiter\"}\"Victor Trevor attended the same college as the narrator, where they became friends.\"{\"tuple_delimiter\"}7)\n\n(\"relationship\"{\"tuple_delimiter\"}\"narrator\"{\"tuple_delimiter\"}\"college\"{\"tuple_delimiter\"}\"The narrator spent two years at the college, where he met Victor Trevor.\"{\"tuple_delimiter\"}7)\n\n(\"relationship\"{\"tuple_delimiter\"}\"narrator\"{\"tuple_delimiter\"}\"Victor Trevor\"{\"tuple_delimiter\"}\"Victor Trevor is the only friend the narrator made during his two years at college.\"{\"tuple_delimiter\"}9)\n\n(\"relationship\"{\"tuple_delimiter\"}\"narrator\"{\"tuple_delimiter\"}\"chapel\"{\"tuple_delimiter\"}\"The narrator was heading to the chapel when he encountered Victor Trevor's bull terrier.\"{\"tuple_delimiter\"}6)\n\n(\"relationship\"{\"tuple_delimiter\"}\"narrator\"{\"tuple_delimiter\"}\"Watson\"{\"tuple_delimiter\"}\"Watson is the person being spoken to by the narrator, indicating a close relationship.\"{\"tuple_delimiter\"}8)\n\n**<|COMPLETE|>**\n#############################\n\n\n\n-Real Data-\n######################\nentity_types: [person, location, organization, document, event, relationship, object]\ntext: - lish Sir Robert in a fair position in life. Both po- lice and coroner took a lenient view of the trans- action, and beyond a mild censure for the delay in registering the lady's decease, the lucky owner got away scatheless from this strange incident in a career which has now outlived its shadows and promises to end in an honoured old age. \n\n\nThe Adventure of the Retired Colourman \n\n\nThe Adventure of the Retired Colourman \n\n\n\nout?' \n\n\nherlock Holmes was in a melancholy and philosophic mood that morning. His alert practical nature was subject to such reactions. \n'Did you see him?\" he asked. \n'You mean the old fellow who has just gone \n\"Precisely.\" \n\"Yes, I met him at the door. \" \n\"What did you think of him?\" \n\"A pathetic, futile, broken creature.\" \n\"Exactly, Watson. Pathetic and futile. But is not all life pathetic and futile? Is not his story a micro- cosm of the whole? We reach. We grasp. And what is left in our hands at the end? A shadow. Or worse than a shadow \u2014 misery.\" \n\"Is he one of your clients?\" \n\"Well, I suppose I may call him so. He has been sent on by the Yard. Just as medical men occasion- ally send their incurables to a quack. They argue that they can do nothing more, and that whatever\n######################\noutput:"}}
{"type": "error", "data": "Entity Extraction Error", "stack": "Traceback (most recent call last):\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\index\\graph\\extractors\\graph\\graph_extractor.py\", line 118, in __call__\n    result = await self._process_document(text, prompt_variables)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\index\\graph\\extractors\\graph\\graph_extractor.py\", line 146, in _process_document\n    response = await self._llm(\n               ^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\openai\\json_parsing_llm.py\", line 34, in __call__\n    result = await self._delegate(input, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\openai\\openai_token_replacing_llm.py\", line 37, in __call__\n    return await self._delegate(input, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\openai\\openai_history_tracking_llm.py\", line 33, in __call__\n    output = await self._delegate(input, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\caching_llm.py\", line 104, in __call__\n    result = await self._delegate(input, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\rate_limiting_llm.py\", line 177, in __call__\n    result, start = await execute_with_retry()\n                    ^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\rate_limiting_llm.py\", line 159, in execute_with_retry\n    async for attempt in retryer:\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\tenacity\\asyncio\\__init__.py\", line 166, in __anext__\n    do = await self.iter(retry_state=self._retry_state)\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\tenacity\\asyncio\\__init__.py\", line 153, in iter\n    result = await action(retry_state)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\tenacity\\_utils.py\", line 99, in inner\n    return call(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\tenacity\\__init__.py\", line 418, in exc_check\n    raise retry_exc.reraise()\n          ^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\tenacity\\__init__.py\", line 185, in reraise\n    raise self.last_attempt.result()\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Python311\\Lib\\concurrent\\futures\\_base.py\", line 449, in result\n    return self.__get_result()\n           ^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Python311\\Lib\\concurrent\\futures\\_base.py\", line 401, in __get_result\n    raise self._exception\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\rate_limiting_llm.py\", line 165, in execute_with_retry\n    return await do_attempt(), start\n           ^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\rate_limiting_llm.py\", line 151, in do_attempt\n    await sleep_for(sleep_time)\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\rate_limiting_llm.py\", line 147, in do_attempt\n    return await self._delegate(input, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\base_llm.py\", line 53, in __call__\n    return await self._invoke(input, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\base\\base_llm.py\", line 57, in _invoke\n    output = await self._execute_llm(input, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\llm\\openai\\openai_chat_llm.py\", line 55, in _execute_llm\n    completion = await self.client.chat.completions.create(\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\openai\\resources\\chat\\completions.py\", line 1289, in create\n    return await self._post(\n           ^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\openai\\_base_client.py\", line 1805, in post\n    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\openai\\_base_client.py\", line 1503, in request\n    return await self._request(\n           ^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\openai\\_base_client.py\", line 1599, in _request\n    raise self._make_status_error_from_response(err.response) from None\nopenai.RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-02-15-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 52 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}\n", "source": "Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-02-15-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 52 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}", "details": {"doc_index": 0, "text": "ly premeditated, then the means of covering it are coolly premeditated also. I hope, therefore, that we are in the presence of a serious misconception.\" \n\"But there is so much to explain.\" \n\"Well, we shall set about explaining it. When once your point of view is changed, the very thing which was so damning becomes a clue to the truth. For example, there is this revolver. Miss Dunbar disclaims all knowledge of it. On our new theory she is speaking truth when she says so. There- fore, it was placed in her wardrobe. Who placed it there? Someone who wished to incriminate her. Was not that person the actual criminal? You see how we come at once upon a most fruitful line of inquiry.\" \nWe were compelled to spend the night at Winchester, as the formalities had not yet been completed, but next morning, in the company of Mr. Joyce Cummings, the rising barrister who was entrusted with the defence, we were allowed to see the young lady in her cell. I had expected from all that we had heard to see a beautiful woman, but I can never forget the effect which Miss Dunbar pro- duced upon me. It was no wonder that even the masterful millionaire had found in her something more powerful than himself \u2014 something which could control and guide him. One felt, too, as one looked at the strong, clear-cut, and yet sensitive face, that even should she be capable"}}

Then, there are these log entries that I think are just the consequence of some failure above?

{"type": "error", "data": "Error executing verb \"cluster_graph\" in create_base_entity_graph: Columns must be same length as key", "stack": "Traceback (most recent call last):\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\datashaper\\workflow\\workflow.py\", line 410, in _execute_verb\n    result = node.verb.func(**verb_args)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\index\\verbs\\graph\\clustering\\cluster_graph.py\", line 102, in cluster_graph\n    output_df[[level_to, to]] = pd.DataFrame(\n    ~~~~~~~~~^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\pandas\\core\\frame.py\", line 4299, in __setitem__\n    self._setitem_array(key, value)\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\pandas\\core\\frame.py\", line 4341, in _setitem_array\n    check_key_length(self.columns, key, value)\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\pandas\\core\\indexers\\utils.py\", line 390, in check_key_length\n    raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}
{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\index\\run.py\", line 323, in run_pipeline\n    result = await workflow.run(context, callbacks)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\datashaper\\workflow\\workflow.py\", line 369, in run\n    timing = await self._execute_verb(node, context, callbacks)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\datashaper\\workflow\\workflow.py\", line 410, in _execute_verb\n    result = node.verb.func(**verb_args)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphrag\\index\\verbs\\graph\\clustering\\cluster_graph.py\", line 102, in cluster_graph\n    output_df[[level_to, to]] = pd.DataFrame(\n    ~~~~~~~~~^^^^^^^^^^^^^^^^\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\pandas\\core\\frame.py\", line 4299, in __setitem__\n    self._setitem_array(key, value)\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\pandas\\core\\frame.py\", line 4341, in _setitem_array\n    check_key_length(self.columns, key, value)\n  File \"C:\\Users\\steventruitt\\source\\repos\\graphrag\\graphragtest\\Lib\\site-packages\\pandas\\core\\indexers\\utils.py\", line 390, in check_key_length\n    raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}

The stats.json shows this, so I'm pretty confident that it fails on the merge_graphs entity graph creation:

{
    "total_runtime": 10167.02763581276,
    "num_documents": 1,
    "input_load_time": 0,
    "workflows": {
        "create_base_text_units": {
            "overall": 2.7237725257873535,
            "0_orderby": 0.0060007572174072266,
            "1_zip": 0.006998538970947266,
            "2_aggregate_override": 0.008006811141967773,
            "3_chunk": 2.4501953125,
            "4_select": 0.006997585296630859,
            "5_unroll": 0.01601433753967285,
            "6_rename": 0.008991479873657227,
            "7_genid": 0.11474370956420898,
            "8_unzip": 0.007005214691162109,
            "9_copy": 0.007086038589477539,
            "10_filter": 0.08634305000305176
        },
        "create_base_extracted_entities": {
            "overall": 10163.398600578308,
            "0_entity_extract": 10163.042577505112,
            "1_merge_graphs": 0.3440265655517578
        },
        "create_summarized_entities": {
            "overall": 0.0209963321685791,
            "0_summarize_descriptions": 0.012000083923339844
        }
    }
}

Standalone Binary Deployments of CLIs

e.g. graphrag-index.exe, graphrag-query.exe

Implement non-destructive Entity Resolution

Symbol Zeros is already exposed as ().

i did:

https://microsoft.github.io/graphrag/posts/get_started/

 and OPENAI_KEY, python -m graphrag.index --init --root ./ragtest

and got error

/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
2024-07-03 13:11:26.122415: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/runpy.py", line 146, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graphrag/index/__init__.py", line 40, in <module>
    from .run import run_pipeline, run_pipeline_with_config
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graphrag/index/run.py", line 59, in <module>
    from .verbs import *  # noqa
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graphrag/index/verbs/__init__.py", line 9, in <module>
    from .graph import (
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graphrag/index/verbs/graph/__init__.py", line 10, in <module>
    from .layout import layout_graph
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graphrag/index/verbs/graph/layout/__init__.py", line 6, in <module>
    from .layout_graph import layout_graph
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graphrag/index/verbs/graph/layout/layout_graph.py", line 13, in <module>
    from graphrag.index.graph.visualization import GraphLayout
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graphrag/index/graph/visualization/__init__.py", line 6, in <module>
    from .compute_umap_positions import compute_umap_positions, get_zero_positions
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graphrag/index/graph/visualization/compute_umap_positions.py", line 6, in <module>
    import graspologic as gc
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graspologic/__init__.py", line 4, in <module>
    import graspologic.align
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graspologic/align/__init__.py", line 5, in <module>
    from .seedless_procrustes import SeedlessProcrustes
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/graspologic/align/seedless_procrustes.py", line 7, in <module>
    import ot
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/ot/__init__.py", line 21, in <module>
    from . import lp
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/ot/lp/__init__.py", line 20, in <module>
    from .dmmot import dmmot_monge_1dgrid_loss, dmmot_monge_1dgrid_optimize
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/ot/lp/dmmot.py", line 12, in <module>
    from ..backend import get_backend
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/ot/backend.py", line 145, in <module>
    import tensorflow as tf
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/tensorflow/__init__.py", line 470, in <module>
    _keras._load()
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/tensorflow/python/util/lazy_loader.py", line 41, in _load
    module = importlib.import_module(self.__name__)
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/__init__.py", line 21, in <module>
    from keras import models
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/models/__init__.py", line 18, in <module>
    from keras.engine.functional import Functional
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/engine/functional.py", line 26, in <module>
    from keras import backend
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/backend/__init__.py", line 3, in <module>
    from keras.backend import experimental
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/backend/experimental/__init__.py", line 3, in <module>
    from keras.src.backend import disable_tf_random_generator
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/__init__.py", line 21, in <module>
    from keras.src import applications
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/applications/__init__.py", line 18, in <module>
    from keras.src.applications.convnext import ConvNeXtBase
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/applications/convnext.py", line 28, in <module>
    from keras.src import backend
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/backend.py", line 35, in <module>
    from keras.src.engine import keras_tensor
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/engine/keras_tensor.py", line 19, in <module>
    from keras.src.utils import object_identity
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/utils/__init__.py", line 20, in <module>
    from keras.src.saving.serialization_lib import deserialize_keras_object
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/saving/serialization_lib.py", line 28, in <module>
    from keras.src.saving.legacy.saved_model.utils import in_tf_saved_model_scope
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/saving/legacy/saved_model/utils.py", line 30, in <module>
    from keras.src.utils.layer_utils import CallFunctionSpec
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/utils/layer_utils.py", line 26, in <module>
    from keras.src import initializers
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/initializers/__init__.py", line 23, in <module>
    from keras.src.initializers import initializers_v1
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/keras/src/initializers/initializers_v1.py", line 32, in <module>
    keras_export(v1=["keras.initializers.Zeros", "keras.initializers.zeros"])(
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/tensorflow/python/util/tf_export.py", line 348, in __call__
    self.set_attr(undecorated_func, api_names_attr, self._names)
  File "/Users/ankushsingal/miniconda3/envs/snakes/lib/python3.10/site-packages/tensorflow/python/util/tf_export.py", line 363, in set_attr
    raise SymbolAlreadyExposedError(
tensorflow.python.util.tf_export.SymbolAlreadyExposedError: Symbol Zeros is already exposed as ().

graphrag.prompt_tune throws error while parsing the CSV input files

I am creating prompt examples using prompt_tune module. I have a single CSV file containing 50 rows. I am running the command:
python -m graphrag.prompt_tune --root <root_dir_name> --method random --limit 10 --no-entity-types --output <output_dir_name>.

I have set all the required env variables. The error is:

  File "graphrag/index/input/csv.py", line 38, in load_file
    data = pd.read_csv(buffer, encoding=config.encoding or "latin-1")
                                        ^^^^^^^^^^^^^^^
AttributeError: 'InputConfig' object has no attribute 'encoding'

I suppose this should config.fine_encoding instead of config.encoding (as per the create_graphrag_config.py), although not sure. Please help me resolve the error. Thanks!

Add Windows Runner to CI jobs

Add windows runner to github actions ci

Wrong link in get_started

Requirements

Python 3.10-3.12

To get started with the GraphRAG system, you have a few options:

👉 Use the GraphRAG Accelerator solution

👉 Install from pypi.

👉 Use it from source

In get_started page found a wrong link for Use it from source:
https://github.com/microsoft/graphrag/blob/main/posts/developing
I think it should be
https://github.com/microsoft/graphrag/blob/main/docsite/posts/developing.md

There are some questions about application integration

How do I store graph content in a database instead of a file？
Is there an HTTP API available for application integration queries?

Expand CSV Parsing Options

Our CSV parsing tools support a pretty rich set of input parsing options, we should expose some of these into our CSV Input configuration model.

Add a search option to the documentation

Add a search option to the documentation
https://microsoft.github.io/graphrag/

Warning: 'DataFrame.swapaxes' is deprecated

/XXX/graphrag/.venv/lib/python3.10/site-packages/numpy/core/fromnume
ric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in 
a future version. Please use 'DataFrame.transpose' instead.

How to integrate the function into my rag system

That's pretty familiar function with the recently repo which name “Hipporag”
But one more thing i still want to know is ....
Dose it the can integrate the function into my rag system

Provide a control flag to only return context that is cited in the response

As part of pulling context for the response synthesis a lot of information is retrieved that is ultimately not used for the response. In some cases this is helpful while in others it is distracting. An option to toggle this to a more limited mode (e.g. "return_full_context=True/False" in the context builder params) would provide a lot of clarity for some applications and reduce boilerplate code to scan for and filter out unused context.

Example code snippet:

context_builder_params = {
"use_community_summary": False, # False means using full community reports. True means using community short summaries.
"shuffle_data": True,
"include_community_rank": True,
"min_community_rank": 0,
"max_tokens": 12_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
"context_name": "Reports",
"return_full_context: True
}

Unknown workflow: entity_extraction when running graphrag\examples\entity_extraction\with_graph_intelligence\run.py

Hi, sorry if I have overlooked a simple solution to this, but I am running into the following error when trying to run the example scripts for entity extraction with graph intelligence.

Example file: entity_extraction when running graphrag\examples\entity_extraction\with_graph_intelligence\run.py

Exception has occurred: UnknownWorkflowError
Unknown workflow: entity_extraction
File "C:\Users\user\dev\MSGraphRag\graphrag\examples\entity_extraction\with_graph_intelligence\run.py", line 95, in run_python
async for table in run_pipeline(dataset=dataset, workflows=workflows):
File "C:\Users\user\dev\MSGraphRag\graphrag\examples\entity_extraction\with_graph_intelligence\run.py", line 107, in
asyncio.run(run_python())
graphrag.index.errors.UnknownWorkflowError: Unknown workflow: entity_extraction

it looks like the 'entity_extraction' workflow is not in the default_workflows.py file or in the workflows.v1 folder either

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

i follow the get started documentation from this page: https://microsoft.github.io/graphrag/posts/get_started/ . and when i run this command "python -m graphrag.index --root ./ragtest", it shows error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte". Any solution?

PII Redaction Phase

This would be implemented after Entity Resolution, generate pseudonyms for PERSON/ORG type entities.

Documentation request - add links to sample CSV / step to convert text to input CSV

Apologies if this is just my lack of understanding, but going through the getting started tutorial, there seems to be a step missing?

We download a book from project gutenberg in text format, and then we start the indexer.
However the indexer expects CSV files in the input folder and we only have the book .txt file.

I checked the dulce.csv file that also in the repo to transform my input into something acceptable, but I think either:

an example CSV
a step to convert the txt file into CSV
a short description of the expected CSV format

would help people who are starting out.

cursor disappears on completion of execution

system: ubuntu 22.04

Whenever a graphrag.* module is executed, the terminal cursor is turned off. Annoying.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Configuration token-replacement not working with non-str Pydantic fields

e.g. a YAML configuration like this will break pydantic model validation. We should add a default_config_parameters_from_dict function that accepts a dict from yaml or json and then

Performs token replacements in dictionary
Converts the dictionary types into expected model types
Call the existing default_config_parameters

chunks:
    overlap: ${GRAPHRAG_CHUNK_OVERLAP}

Azure AI Vector Store Batching Errors

reported by Kenneth Chen @ CELA; some batch uploads are failing on their end

Docs: https://learn.microsoft.com/en-us/azure/search/search-what-is-data-import - they may be running into the 16MB per batch limit

any examples to use with langhcian/llamaindex?

Handle `summarize` context overflows

A partner reported that the summarize step would overflow the context buffer in their runs. We should take a look at the summarize step and possibly implement a multi-stage summarization process.

Remove GraphML-based dataframes, use steps that emit Node+Edge dataframes instead

Use os.linesep in CLI, logging code

We should use a generic line separator where it's appropriate.

This SHOULD NOT be included in prompt definitions.

Unified Docsite with Narrative & API Documentation

Our current docsite is focused on the narrative "why", and using the CLI entry points. We should also have proper API documentation. In order to achieve this, we should move the docsite to Sphinx (or another solution) and properly document the surface area of our primary APIs.

Need to add markdown files for verbs overview and workflows overview in docsite

UniversalNER strategy for Entity Extraction

Extract entities with UniversalNER
Use LLM to extract relationships based on NER output

Prompt Gym

It could be useful to create a "prompt gym" where prompts + models + parameters can be tuned and compared against our best-performing runs. We'll need some good labeled data and the ability to measure LLM output quality.

	is_using_vector_store = (
	entity_name_embed_config.get("strategy", {}).get("vector_store", None)
	is not None
	)