Code Monkey home page Code Monkey logo

synthetic-data-generation-with-langchain's Introduction

Synthetic Data Generation with LangChain and LLMs

This repository provides tools for generating synthetic data using either OpenAI's GPT-3.5-turbo or Ollama's Llama 3-8B. You can use any model from ollama but I tested with llama3-8B in this repository.

Features:

  • Flexible Model Selection: Choose between OpenAI's GPT-3.5-turbo or Ollama's Llama 3 models for your data generation needs.
  • Customizable Data Generation: Easily modify the provided model modules to generate specific data formats and structures.
  • Scalable Generation: Specify the number of runs (data generation attempts) to control the volume of synthetic data produced.

Prerequisites:

  1. Python 3.7+: Ensure you have Python installed on your system.
  2. Virtual Environment: Create a virtual environment for managing project dependencies.
  3. Requirements: Install the necessary libraries using the provided requirements.txt file.

Installation:

  1. Clone the repository:
    git clone https://github.com/yazanrisheh/Synthetic-Data-Generation-with-LangChain
  2. Navigate to the project directory:
    cd Synthetic-Data-Generation-with-LangChain
  3. Create a virtual environment (recommended):
    python -m venv .venv
  4. Activate the virtual environment:
    source .venv/bin/activate 
  5. Install project dependencies:
    pip install -r requirements.txt

Running the Scripts:

  1. Ollama:
  2. OpenAI:

Configuration:

  • Model Selection:
    • Modify the ollama_model.py or openai_model.py file to customize the data generation prompts and structure for your desired output.
  • Number of Runs:
    • Adjust the runs variable in the respective script to control the number of data generation attempts.

Example Usage:

  • Generate 100 product descriptions using OpenAI's GPT-3.5-turbo:

    • Update openai_model.py with your product description prompts.
    • Set runs = 100 in openai_synthetic_generation.py.
    • Run python openai_synthetic_generation.py.
  • Generate 50 customer reviews for a specific product using Ollama's Llama 3:

    • Modify ollama_model.py to include customer review prompts and format.
    • Set runs = 50 in ollama_synthetic_generation.py.
    • Ensure your Ollama settings are configured correctly.
    • Run python ollama_synthetic_generation.py.

Important Notes:

  • Note that the models do not always generate the same amount of data points specificed "runs" especially using ollama

  • OpenAI Tests: Running 50 generations cost anywhere from $0.01 to $0.07, taking around 64 seconds each time.

  • Ollama Performance: Running 100 generations using the Llama 3 model produced 571 outputs in 8299 seconds (2.3 hours).

synthetic-data-generation-with-langchain's People

Contributors

yazanrisheh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.