Code Monkey home page Code Monkey logo

delta-buddy's Introduction

Delta-Buddy

Delta-Buddy

licence

Introducing Delta-Buddy: Your ultimate Delta Lake companion! 🐍 Streamline your data journey with an AI-powered chatbot. Ask Delta-Buddy anything about your Delta Lake.

Demo

delta-buddy-demo.mp4

⚡️Features

  • A chatbot to ask questions to Dolly based on your documents and datasets.
  • Ingest documents in a Chroma database locally or from a Databricks notebook.
  • Provide a web UI based on Chainlit to ask questions and receive answers.
  • Provide a API based on FastAPI to receive and questions from everywhere.
  • Run locally or on Databricks while keeping your data safe.

📚 Documentation

The documentation for Delta-Buddy is in construction.

📦 Installation

Locally

  • Configure the env.sample file in the root and and rename it to .env with your configuration (use local for the execution context).
  • Install the virtual environment and the dependencies
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  • You could also install the development dependencies if you want to contribute to Delta-Buddy
pip install -r requirements-dev.txt
  • Run the following example data_preparation.prepare_delta_buddy.py to start the data preparation using the following command:
make prepare-data
  • Delta-Buddy is ready, launch the UI with the following command:
make launch-ui
  • When everything is running well, you are ready to use the UI to ask questions to Delta-Buddy.

Disclaimer: for the first run, it could take some time to download the LLM model.

On Databricks

  • Add the delta-buddy repo to Databricks (under Repos click Add Repo, enter https://github.com/fvaleye/delta-buddy.git, then click Create Repo). More information here. Start a 13.0 LTS ML (includes Apache Spark 3.4.0, GPU, Scala 2.12) single-node cluster with node type having 8 A100 GPUs (e.g. Standard_ND96asr_v4 or p4d.24xlarge). Note that these instance types may not be available in all regions, or may be difficult to provision. In Databricks, note that you must select the GPU runtime first, and unselect "Use Photon", for these instance types to appear (where supported).

  • Configure the env.sample file in the root and rename it to .env with your configuration and use databricks for the execution context for your databricks configuration.

  • Configure the env.sample file in the notebooks folder and rename it to .env for your notebook configuration.

  • Open the notebooks folder inside the Repo (which are delta_buddy_preparation.py first and then delta_buddy_run.py), attach to your GPU cluster, and run all cell.

  • Open the notebooks folder and launch the notebook delta_buddy_preparation.py to prepare the Chroma database on your Databricks cluster.

  • Open the notebooks folder and launch the notebook delta_buddy_run.py to test the chatbot on your Databricks cluster.

  • You have different serving mode for delta_buddy_run.py: local, by notebook api or llm connection (see the environment variables to choose the best serving mode).

  • When everything is running well, you are ready to ask questions to Delta-Buddy.

  • Launch the UI connected to Databricks depending on the serving mode with the following command:

make launch-ui

Disclaimer: for the first run, it could take some time to download the LLM model.

🔎 Usage

In construction.

🔒Privacy & Security

Delta-Buddy is designed to run locally or on Databricks with Dolly to not share your data with anyone.

⚙️ Environment Variables

The environment variables can be set from an environment file .env.

Parameter Description
EXECUTION_CONTEXT The execution context for running Delta-Buddy: local or databricks (look the DATABRICKS_SERVING_MODE to specify the serving access).
PREPARATION_MODEL_NAME The model used for preparation and execution for the sentence transformer.
EMBEDDINGS_MODEL_NAME The model used for preparation and execution for the sentence transformer.
SOURCE_DOCUMENTS_DIRECTORY The directory to store on disk the documents to be ingested in the Chromadb database.
PERSIST_DIRECTORY The directory to persist the Chromadb database.
SOURCE_DOCUMENTS_MAX_COUNT The number of sources to use when prompting the question to Dolly.
DATABRICKS_MODEL_NAME The name of the Databricks Dolly model.
DATABRICKS_CLUSTER_ID The identifier of the Databricks cluster to use for llm or notebook run.
DATABRICKS_NOTEBOOK_PATH The path of delta_buddy_run.py notebook in the dbfs of Databricks.
DATABRICKS_SERVER_HOSTNAME The server hostname to use to access your Databricks account.
DATABRICKS_TEXT_TO_SQL_MODEL The model to use for Text to SQL translations.
DATABRICKS_HTTP_PATH The HTTp path of the Databricks Warehouse to use for fetching metadata.
DATABRICKS_TOKEN The Token of your Databricks account to access clusters or metadata.
DATABRICKS_LLM_PORT The port to use for accessing the model's API on the delta_buddy_run.py notebook.
DATABRICKS_SERVING_MODE The serving mode for accessing the model: local, notebook_hosted_api, notebook_api.

🛡️ License

Delta-Buddy is licensed under the Apache License 2.0. See the LICENSE file for more details.

✨Contributing

Contributions are welcome! Please check out the todos below, and feel free to open a pull request. For more information, please see the contributing guidelines.

After installing the virtual environment, please remember to install pre-commit to be compliant with our standards:

🗺️ Todo

  • Ask questions on Delta Lake tables with Text to SQL capabilities.
  • Add a chat history
  • Improve the CI and the tests
  • Integrate MLFlow for serving the model
  • Improve the FastAPI features and documentation

Disclaimer

This is a early-stage project to validate the feasibility of a fully private solution for question answering using Dolly and Vector embeddings. It's not production ready yet.

delta-buddy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.