Code Monkey home page Code Monkey logo

gpt-doc2data's Introduction

gpt-doc2data

[ English | δΈ­ζ–‡ ]

Welcome to join our Wechat group chat!

When collecting data for LLM finetuning, obtaining formatted data from existing documents can be costly. Fortunately, gpt-doc2data comes to the rescue! It is a GPT-based data generator that allows you to input your private data, and large language models will create high-quality question and answer pairs. These pairs can then be utilized for fine-tuning or prompt-tuning your own model.

Example

Below are 10 QA pairs generated after feeding the LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS paper into the system:

[
    {
        "question": "What is LoRA?",
        "answer": "LoRA is a method that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture to reduce the number of trainable parameters."
    },
    {
        "question": "What are the advantages of LoRA?",
        "answer": "The advantages of LoRA include reduced number of trainable parameters, lower GPU memory requirement, higher training throughput, no additional inference latency, and the ability to switch tasks efficiently."
    },
    {
        "question": "How does LoRA compare to fine-tuning in terms of model quality?",
        "answer": "LoRA performs on-par or better than fine-tuning in model quality on various language models, despite having fewer trainable parameters and higher training throughput."
    },
    {
        "question": "Which weight matrices in the Transformer architecture should be adapted with LoRA?",
        "answer": "LoRA should be applied to the weight matrices in the self-attention module, specifically Wq and Wv, for optimal performance."
    },
    {
        "question": "What is the optimal rank for LoRA?",
        "answer": "A low rank, such as 1 or 2, is sufficient for LoRA to achieve competitive performance on downstream tasks."
    },
    {
        "question": "What is the advantage of few-shot learning?",
        "answer": "Few-shot learning is advantageous when we only have a handful of training samples."
    },
    {
        "question": "What is the difference between adapter layers and LoRA?",
        "answer": "Adapter layers are computed in addition to the base model, introducing additional latency, while LoRA is added in a parallel manner."
    },
    {
        "question": "What is the GLUE Benchmark?",
        "answer": "The GLUE Benchmark is a collection of natural language understanding tasks used to evaluate NLU models."
    },
    {
        "question": "What is the purpose of the E2E NLG Challenge dataset?",
        "answer": "The E2E NLG Challenge dataset is used for training end-to-end, data-driven natural language generation systems."
    },
    {
        "question": "What is the amplification factor for task-specific directions in LoRA?",
        "answer": "The amplification factor for task-specific directions in LoRA is around 20."
    }
]

Getting Started

Install requirements

git clone https://github.com/codewangg/gpt-doc2data.git
cd gpt-doc2data
pip install -r requirements.txt

Prepare your documents

Currently supported file format:

  • PDF
  • Markdown
  • TXT

All the files should be put under gpt-doc2data/data directory

config.yaml

Rename example_config.yaml to config.yaml and modify it to suit your requirements and provide your own openai API key.

Generate QA pairs

python3 gpt-doc2data/gpt-doc2data.py

TODO

Low-hanging Fruits

  • Add an "id" field in the output JSON.
  • Improve the README.md for better understanding and usage.
  • We need a Chinese README page.
  • Clean up and add comments and type specifiers to the codebase (currently over 50% generated by LLM).

Medium-hanging Fruits

  • Improve the method for estimating the generated QA pair token number, as the current approach may waste tokens for each API call.
  • Add support to configure the output JSON key's name.
  • Add rate-limiter to avoid overloading the openai api.

High-hanging Fruits

  • Integrate the tool with local/private-served open-source models to reduce the cost associated with using the openai API due to high throughput.
  • Extend support for more file types, such as audio and videos, to serve as useful information sources. Broaden the tool's capabilities to generate different formats of outputs for fine-tuning, not just limited to QA pairs.
  • Implement a human judge mechanism to ensure high-quality data generation when needed.

gpt-doc2data's People

Contributors

haohww avatar blackrice00 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.