fuzzylabs / mindgpt Goto Github PK

A Mental Health conversational LLM

License: Apache License 2.0

Python 57.97% HTML 1.05% HCL 1.02% Dockerfile 0.36% Shell 0.24% Jupyter Notebook 39.36%

mindgpt's Introduction

MindGPT 🧠

MindGPT is a conversational system where users can ask mental health-orientated questions and receive answers which summarises content from two leading mental health websites: Mind and NHS Mental Health.

It's not a digital counsellor or therapist and the output from the system should not be treated as such, MindGPT is purely focused on acting as a gateway to vital information sources, summarising two authoritative perspectives and providing pointers to the original content.

In building this, we've drawn on our expertise in MLOps and prior experience in fine-tuning open-source LLMs for various tasks (see here for an example of one tuned to summarise legal text). If you're interested in how MindGPT works under the hood and what technologies and data we've used, then take a look here.

⁉ Why?

Mental health problems are something that everyone struggles with at various points in their life and finding the right type of help and information can be hard, even a blocker. Mind, one of the main mental health charities in the UK, puts this best: when you're living with a mental health problem, or support someone who is, having access to the right information is vital.

MindGPT sets out to increase ease of access to this information.

👀 Follow along

This project is in active development at Fuzzy Labs and you can follow along!

The repository for this project is one method where you can monitor progress - we're always raising pull requests to move the project forward. Another way you can do this is to follow our blog. We'll be posting periodic updates as we tick off sections of the project. If you're new to the project, the best place to start is our project introduction.

🏃 How do I get started?

Data Scraping pipeline

Before running the data scraping pipeline, an additional setup for creating a storage container on Azure is required to store the data. We use the DVC tool for data versioning and data management. Data version documentation guides you through the process of setting up a storage container on Azure and configuring DVC.

In this pipeline, there are two steps:

Scrape data from Mind and NHS Mental Health websites
Store and version the scraped data in a storage container on Azure using DVC

Now that you're all setup, let's run the data scraping pipeline.

python run.py --scrape

Data Preparation pipeline

Now that we have data scraped, we're ready to prepare that data for the model. We've created a separate pipeline for this, where we clean, validate, and version the data.

We run the data preparation pipeline using the following command.

python run.py --prepare

Data Embedding pipeline

To run the embedding pipeline, Azure Kubernetes Service (AKS) needs to be provisioned. We use AKS to run the Chroma (vector database) service.

matcha tool can help you in provisioning these resources. Install matcha-ml library and provision resources using matcha provision command.

pip install matcha-ml
matcha provision

After the provisioning completes, we will have on hand these resources:

Kubernetes cluster on Azure
Seldon Core installed on this cluster
Istio ingress installed on this cluster

Chroma

Next, we apply Kubernetes manifests to deploy Chroma server on AKS using following commands

kubectl apply -f infrastructure/chroma_server_k8s

Port-forward the chroma server service to localhost using the following command. This will ensure we can access the server from localhost.

kubectl port-forward service/chroma-service 8000:8000

Monitoring

To run the monitoring service on k8s, matcha provision must be run beforehand. We will need to build and push the metric service application to ACR. This image will be used by Kubernetes deployment. Before that, we need to set two bash variables, one for ACR registry URI and another for ACR registry name. We will use matcha get command to do this.

acr_registry_uri=$(matcha get container-registry registry-url --output json | sed -n 's/.*"registry-url": "\(.*\)".*/\1/p')
acr_registry_name=$(matcha get container-registry registry-name --output json | sed -n 's/.*"registry-name": "\(.*\)".*/\1/p')

Now we're ready to login into ACR, build and push the image to the ACR.

az acr login --name $acr_registry_name
docker build -t $acr_registry_uri/monitoring:latest -f monitoring/metric_service/Dockerfile .
docker push $acr_registry_uri/monitoring:latest

Line number 39 in monitoring-deployment.yaml should be updated to match the Docker image name which we've just pushed to the ACR, and it will need to be in the following format: <name-of-acr-registry>.azurecr.io/monitoring.

Next, we apply the Kubernetes manifest to deploy the metric service and the metric database on AKS.

kubectl apply -f infrastructure/monitoring

Once kubectl has finished applying the manifest, we should verify that the monitoring service is running. Running the commands below will give you an IP address for the service, which we can then curl for a response:

kubectl get pods # Checking whether the monitoring pod is running

# Expected output (Note the name of pod will be different)
NAME                                    READY   STATUS    RESTARTS   AGE
monitoring-service-588f644c49-ldjhf     2/2     Running   0          3d1h

kubectl get svc monitoring-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

We should be able to curl the external IP returned running the above command at port 5000.

curl {external-ip:5000}

# The response should be:
Hello world from the metric service.

Now we know that everything is up and running, we need to use port-forwarding so the embedding pipeline (which is running locally) can communicate with the service hosted on Kubernetes:

kubectl port-forward service/monitoring-service 5000:5000

In data embedding pipeline, we take the validated dataset from data preparation pipeline and use Chroma vector database to store the embedding of the text data. This pipelines uses both the Mind and NHS data.

Finally, in a separate terminal we can run the data embedding pipeline.

python run.py --embed

Note: This pipelines might take somewhere between 5-10 mins to run.

Provision pre-trained LLM

Preparation

To deploy a pre-trained LLM model, we first need a Kubernetes cluster with Seldon Core. matcha tool can help you in provisioning the required resources. See the section above on how to set this up.

Deploy model

Apply the prepared kubernetes manifest to deploy the model:

kubectl apply -f infrastructure/llm_k8s/seldon-deployment.yaml

This will create a Seldon deployment, which consists of:

Pod, that loads the pipeline and the model for inference
Service and ingress routing, for accessing it from the outside

Query model

You can get ingress IP with matcha:

matcha get model-deployer base-url

Full URL to query the model:

http://<INGRESS_IP>/seldon/matcha-seldon-workloads/llm/v2/models/transformer/infer

The expected payload structure is as follows:

{
    "inputs": [
        {
            "name": "array_inputs",
            "shape": [-1],
            "datatype": "string",
            "data": "Some prompt text"
        }
    ]
}

Monitoring

Running locally

To run the monitoring service on your local machine, we'll utilise docker-compose. This will initialise two services - the metric service interface, which listens for POST and GET requests, and the metric database service.

To run docker-compose:

docker-compose -f monitoring/docker-compose.yml up

Once the two containers has started, we can curl our metric service from the outside.

curl localhost:5000/
# This should return a default message saying "Hello world from the metric service."

curl -X POST localhost:5000/readability -H "Content-Type: application/json" -d '{"response": "test_response"}'
# This should compute a readability score and insert the score into the "Readability" relation. We should also expect the following response message:
"{"message":"Readability data has been successfully inserted.","score":36.62,"status_code":200}

curl -X POST localhost:5000/embedding_drift -H "Content-Type: application/json" -d '{"reference_dataset": "1.1", "current_dataset": "1.2", "distance": 0.1, "drifted": true}'
# This should insert the embedding drift data to our "EmbeddingDrift" relation. If success, we should see the following response message:
"{"message":"Validation error: 'reference_dataset is not found in the data dictionary.'","status_code":400}"

# We can also query our database with:
curl localhost:5000/query_readability

Monitoring MindGPT 👀

We've created a notebook which accesses the monitoring service, fetches the metrics, and creates some simple plots showing the change over time.

This is a starting point for accessing the metrics, and we're planning to introduce a hosted dashboard version of these plots at some point in the future.

Streamlit Application

To deploy the Streamlit application on AKS, we first need to build a Docker image and then push it to ACR.

Note: Run the following command from the root of the project.

Verify that you are in the root of the project.

pwd

/home/username/MindGPT

We build and push the streamlit application to ACR. This image will be used by Kubernetes deployment. Before that, we need to set two bash variables, one for ACR registry URI and another for ACR registry name. We will use matcha get command to do this.

acr_registry_uri=$(matcha get container-registry registry-url --output json | sed -n 's/.*"registry-url": "\(.*\)".*/\1/p')
acr_registry_name=$(matcha get container-registry registry-name --output json | sed -n 's/.*"registry-name": "\(.*\)".*/\1/p')

Now we're ready to login into ACR, build and push the image to the ACR.

az acr login --name $acr_registry_name
docker build -t $acr_registry_uri/mindgpt:latest -f app/Dockerfile .
docker push $acr_registry_uri/mindgpt:latest

Line number 19 in streamlit-deployment.yaml should be updated to match the Docker image name which we've just pushed to the ACR, and it will need to be in the following format: <name-of-acr-registry>.azurecr.io/mindgpt.

Next, we apply the Kubernetes manifest to deploy the streamlit application on AKS.

kubectl apply -f infrastructure/streamlit_k8s

Finally, we verify the streamlit application. The command below should provide an IP address for the streamlit application.

kubectl get service streamlit-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

If you visit that URL in browser, you should be able to interact with the deployed streamlit application.

🤝 Acknowledgements

This project wouldn't be possible without the exceptional content on both the Mind and NHS Mental Health websites.

📜 License

This project is released under the Apache License. See LICENSE.

mindgpt's People

Contributors

Stargazers

Watchers

Forkers

proudfeet humansoftricity ego id-2

mindgpt's Issues

Pre-commit versions should be updated

Description of the problem

As mentioned in MindGPT-72 several of the pre-commit hooks could do with updating as they all appear to be running on versions from around March 2023 (or July 2020 in the case of pre-commit-hooks).

Details

Repo	MindGPT	Latest
`pre-commit-hooks`	3.2.0	4.4.0
`black`	22.10.0	23.7.0
`ruff-pre-commit`	0.0.257	0.0.287
`mirrors-mypy`	1.1.1	1.5.1
`typos`	1.16.1	1.16.11

You might not want to be right at the bleeding edge for some of the packages (say ruff-pre-commit which is still fairly new) but it would be worth updating in general I think.

Potential approach 1

Run the builtin autoupdate in pre-commit:

pre-commit autoupdate

This will automatically update to the most recent revision on the primary branch:

[https://github.com/pre-commit/pre-commit-hooks] updating v3.2.0 -> v4.4.0
[https://github.com/psf/black] updating 22.10.0 -> 23.7.0
[https://github.com/charliermarsh/ruff-pre-commit] updating v0.0.257 -> v0.0.287
[https://github.com/pre-commit/mirrors-mypy] updating v1.1.1 -> v1.5.1
[https://github.com/crate-ci/typos] updating v1.16.1 -> v1.16.11

Run pre-commit against all files in the repo to check the up-versioning has not resulted in any failures

Spoilers: it will I am afraid.

E.g. black has changed some of their prioritisation for long-line wrapping in function names.

The good news is that most of them will be autofixed with a single run (or when I have had a quick look 90% of the issues are auto-resolved).

Update the dev dependencies (manually unfortunately)

Unfortunately I don't think there is a nice way to get poetry to automatically pick up the updated pre-commit hook versions so you would have to manually uplift the dependencies in pyproject.toml

Update the lock file

poetry update

Optionally you could just update the dev dependencies with:

poetry update --only dev

Potential approach 2

Not really investigated this route much but would be worth a further look.

The sync_with_poetry hook flips approach 1 and takes the rev generated by poetry in the lock file as the package rev to use. It then pushes that value into the .pre-commit-config.yaml file, ensuring they remain in sync with poetry.

Documentation

Once decided upon, it would be worth including some developer documentation on how to keep the packages up to date across poetry and pre-commit.

An aside

It is also worth noting that you do not have some of your pre-commit dependencies included in the [tool.poetry.group.dev.dependencies] block. Namely: black and ruff.

Not sure if this is intentional on your part or not? You get them as secondary dependencies through several other packages but you might want to look at including them.

Proposed inclusion of Poetry pre-commit hooks

Description of the proposal

Combining poetry and pre-commit can be very powerful. Currently both are being used somewhat in isolation from each other.

There are a collection of poetry specific pre-commit hooks that can be used to ensure poetry configuration is correct before a commit.

Details

More information on the poetry hooks can be found here.

I would suggest using:

poetry-check - checks the config is valid
poetry-lock - runs poetry lock ahead of a commit to ensure the lock file is up to date*

*You may only want to use the poetry-lock command on merge checks as I would have thought it would be quite rare the requirements will change. This might be easier to implement as a check on git-push to the develop branch.

Example

It would look something like the following:

repos:
-   repo: https://github.com/python-poetry/poetry
    rev: '1.6.1'
    hooks:
    -   id: poetry-check
    -   id: poetry-lock
    
    ...

Impact

There will be a slight time increase due to any additions but it should be pretty negligible compared to say mypy which is painfully slow at times.

Pre-commit failing when run against all files

Hey!

(Can't see an issue template so apologies if this isn't the format you would prefer).

When experimenting locally I noticed that some of the pre-commit hooks are failing when run against all files.
Nothing major at all, just thought I would flag them so you get greens across the board on your checks.

Environment

OS: Ubuntu 20.04
Python version: 3.10.5 (through pyenv as per developer setup)

Description of the problem

Running pre-commit (as per command below) against the develop branch, across all files, results in failures.

Command:

pre-commit run --all-files

Failures are for the typo and mypy hooks.

typo
The failures here appear to be a simple config issue. Which is being tracked here.

A temporary solution is to add pass_filenames: false to the hook.

mypy
The failure here appears to be due to some historic type ignore comments on the requests_html lib. I assume they have added typing (very nice of them) at some point since you started using pre-commit.

Solution

Happy to put in PR with these small changes.

(Also noticed some of the pre-commit hooks could do with updating, I will put that in as a separate issue and PR to keep things clean).

ChromaDB Test Failures

Summary

Several of the ChromaDB related tests are failing.

Environment

OS: Ubuntu 20.04
Python version: 3.10.5 (through pyenv as per developer setup)

Running the Tests

On the develop branch I followed the setup instructions in DEVELOPMENT.md.

To run the test suite I tried two approaches:

Run tests from within the poetry shell

poetry shell # explicitly enter the poetry shell

python -m pytest tests --cov

Run tests from outside the poetry shell

poetry run python -m pytest tests --cov

Both resulted in failures for the functions in test_chroma_store.py.

Test Failure Details

Function	Failure Details
test_get_or_create_collection	> assert {"test", "test1"} == set(store.list_collection_names()) E AssertionError: assert {'test', 'test1'} == {'test', 'test1', 'test_new'} E Extra items in the right set: E 'test_new' E Full diff: E - {'test', 'test_new', 'test1'} E + {'test', 'test1'}
test_add_texts	> assert data["documents"] == input_texts E AssertionError: assert ['foo', 'dummy', 'a', 'b'] == ['a', 'b'] E At index 0 diff: 'foo' != 'a' E Left contains 2 more items, first extra item: 'a' E Full diff: E - ['a', 'b'] E + ['foo', 'dummy', 'a', 'b']
test_list_collection_names	> assert {"test", "test1"} == set(store.list_collection_names()) E AssertionError: assert {'test', 'test1'} == {'test', 'test1', 'test_new'} E Extra items in the right set: E 'test_new' E Full diff: E - {'test', 'test_new', 'test1'} E + {'test', 'test1'}

So far as I can tell this issues seem to arise from the persistence of state of the DB across tests due to store._client = local_persist_api here for example.

Due to the persistence of the chroma store operations carried out in one test are impacting the stores state for other tests, resulting in the above errors.

Potential Solution / Discussion

Whilst persisting state is an important feature to test I would suggest that for initial unit-tests you probably do not want this functionality, as it couples your tests, resulting in different outcomes depending on the order of execution.

I would propose splitting this module into two sets of tests:

Isolated, singular unit-tests that are independent and can be run in isolation and have a "clean" DB to interact with at the start of each test.
Higher level, integration style tests that check how multiple, consecutive operations behave. I.e. order of operations is considered.

(1) Isolated, independent unit-tests

For the unit-test suite (1) you could implement a "tear-down" function (i.e. a pytest fixture) which would run after any functions that interact with / modify the database. This function could reset the state of the db to ensure consistent behaviour.

Alternatively you could have a "spin-up" function (again a pytest fixture) which would spin up a new db for each function. This would have a resource overhead attached but it would ensure consistent behavior and would ensure the tests remain independent.

(2) Integration tests

Integration might not quite be the right term here but the key part would be that they operate at a higher level, executing the functionality of several unit operations.

These would be more in-line with actual system usage: i.e. create a DB, update a record, delete a record, finally fetch a record.
They wouldn't be exhaustive but could cover some of the expected usage and would allow you to test for common collections of operations.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.