Code Monkey home page Code Monkey logo

instantqna's Introduction

InstantQnA

Introducing the Instant QnA builder - a powerful tool that allows you to quickly and easily create searchable QnA systems from PDF files. Using state-of-the-art OpenAI technology, this tool generates search embeddings for your documents, making it easy to find the information you need.

How it works ?

  1. Install the project's dependencies:

    Windows:

    pip install -r requirements.txt
    

    Unix:

    python3 -m pip install -r requirements.txt
    
  2. Update constants.py, with your OpenAI API Token

    token="<YOUR-OPENAI-API-TOKEN>"
  3. Place PDFs that you want to search inside /sources directory

  4. Run the program

    Windows:

    python main.py

    Unix:

    python3 main.py

    An estimated cost to embed all of the files will be prompted for y/n. Choose y to proceed further. By default this engine use text-embedding-ada-002 which is less expensive and also perfomant. You can update the code to embed using other models like davinci, etc...

  5. Once all of the files are full processed and embedded, then the program will show a prompt for you to enter your search query, if there are matching results it will return top 3 results with their score and source file name.

Usage

If you have PDF files from which you want to build a question and answer engine, this tool should be useful for you.

Upload PDF file

To begin, select the PDF file that you want to create a QnA system for and upload it to the tool.

read_source.py

This python file reads all of the PDFs file from /sources and then write all of its text content to /ai_generated/dumps.

get_file_data.py

Go through all files in sources and collect which file that hasn't been embedded yet, or the embedding has expired.

Generate search embeddings

Once the file is uploaded, generate search embeddings for the contents of the PDF. This process may take a few minutes, depending on the size of the file.

create_dataset.py

Parses through all text content within a PDF, grouping them into coherent paragraphs no longer than 1000 tokens. This dataset is then saved in a CSV format, providing a structured and readable format for an AI model to process.

embed.py

This file creates the text embedding using OpenAI Ada model (you can customize to any model) and also provides the search/query functions

Execute search queries

You can now execute search queries to find the information you need. Enter your query in the search box and the tool will return any matching results from the PDF.

main.py

The main function where you run the project

instantqna's People

Contributors

raghavan avatar

Stargazers

 avatar  avatar Gordon Wright avatar duoduo avatar  avatar Prathamesh More avatar  avatar  avatar  avatar Aon avatar jlmathews avatar Evan Melrose avatar Rebecca Oliveira avatar  avatar John Lynch avatar  avatar  avatar Abbas avatar Ayush Rout avatar  avatar  avatar  avatar  avatar Matthew Collison avatar

Watchers

 avatar  avatar

instantqna's Issues

Geeting Invalid API token

I Was trying to run this code but getting below error:

➜ InstantQnA git:(main) ✗ python3 main.py

                                    Instant Q&A Builder                                         

ai_generated/embeds/bill (2).pdf.json EMBED NOT FOUND

Reading pdf text content: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 308.78it/s]

Creating dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 89.60it/s]

Calculating total token: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 294.01it/s]

0 tokens in total, (approx. $0.0)
Would you like to embed? (y/n)y
Embedding files: 0%| | 0/1 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Invalid API token
Embedding files: 100%|█

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.